Spark: dynamic schema definition out of csv file - scala

Receiving a schema information as csv file below. Assume i have around 100+columns
FIRSTNAME|VARCHAR2
LASTANME|VARCHAR2
MIDDLENAME|VARCHAR2
BIRTHDATE|DATE
ADULTS|NUMBER
ADDRESS|VARCHAR2
How to generate a schema dynamically in SPARK in this scenario?

You can use string splitting and pattern matching assuming that the schema file is a validly formatted csv. Assuming that you already have the schema loaded as a single comma-separated string, the following will work:
def toSchema(str: String) = {
val structFields = str.split(",").map{ s =>
val split = s.split("\\|")
val name: String = split.head
val typeStr = split.tail.head
val varCharPattern = "varchar[0-9]+".r
val datePattern = "date".r
val numPattern = "number".r
val t = typeStr.toLowerCase match{
case varCharPattern() => StringType
case datePattern() => TimestampType
case numPattern() => IntegerType
case _ => throw new Exception("unknown type string")
}
StructField(name, t)
}
StructType(structFields)
}
You can add more types easily enough by just adding new cases to the pattern matching statement.

Related

Writing null values to Parquet in Spark when the NullType is inside a StructType

I'm importing a collection from MongodB to Spark. All the documents have field 'data' which in turn is a structure and has field 'configurationName' (which is always null).
val partitionDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database", "db").option("collection", collectionName).load()
For the data column in the resulting DataFrame, I get this type:
StructType(StructField(configurationName,NullType,true), ...
When I try to save the dataframe as Parquet
partitionDF.write.mode("overwrite").parquet(collectionName + ".parquet")
I get the following error:
AnalysisException: Parquet data source does not support struct<configurationName:null, ...
It looks like the problem is that I have that NullType buried in the data column's type. I'm looking at How to handle null values when writing to parquet from Spark , but it only shows how to solve this NullType problem on the top-level columns.
But how do you solve this problem when a NullType is not at the top level? The only idea I have so far is to flatten the dataframe completely (exploding arrays and so on) and then all the NullTypes would pop at the top. But in such a case I would lose the original structure of the data (which I don't want to lose).
Is there a better solution?
#Roman Puchkovskiy : Rewritten your function using pattern matching.
def deNullifyStruct(struct: StructType): StructType = {
val items = struct.map { field => StructField(field.name, fixNullType(field.dataType), field.nullable, field.metadata) }
StructType(items)
}
def fixNullType(dt: DataType): DataType = {
dt match {
case _: StructType => return deNullifyStruct(dt.asInstanceOf[StructType])
case _: ArrayType =>
val array = dt.asInstanceOf[ArrayType]
return ArrayType(fixNullType(array.elementType), array.containsNull)
case _: NullType => return StringType
case _ => return dt
}
}
Building on How to handle null values when writing to parquet from Spark and How to pass schema to create a new Dataframe from existing Dataframe? (the second is suggested by #pasha701, thanks!), I constructed this:
def denullifyStruct(struct: StructType): StructType = {
val items = struct.map{ field => StructField(field.name, denullify(field.dataType), field.nullable, field.metadata) }
StructType(items)
}
def denullify(dt: DataType): DataType = {
dt match {
case struct: StructType => denullifyStruct(struct)
case array: ArrayType => ArrayType(denullify(array.elementType), array.containsNull)
case _: NullType => StringType
case _ => dt
}
}
which effectively replaces all NullType instances with StringType ones.
And then
val fixedDF = spark.createDataFrame(partitionDF.rdd, denullifyStruct(partitionDF.schema))
fixedDF.printSchema

Spark - Convert all Timestamp columns to a certain date format

I have a use case where I need to read data from Hive tables (Parquet), convert Timestamp columns to a certain format and write the output as csv.
For the date format thing, I want to write a function that takes a StructField and returns either the original field name or date_format($"col_name", "dd-MMM-yyyy hh.mm.ss a"), if the dataType is TimestampType. This is what I have come up with so far
def main (String[] args) {
val hiveSchema = args(0)
val hiveName = args(1)
val myDF = spark.table(s"${hiveSchema}.${hiveTable}")
val colArray = myDF.schema.fields.map(getColumns)
val colString = colArray.mkString(",")
myDF.select(colString).write.format("csv").mode("overwrite").option("header", "true").save("/tmp/myDF")
}
def getColumns(structField: StructField): String = structField match {
case structField if(structField.dataType.simpleString.equalsIgnoreCase("TimestampType")) => s"""date_format($$"${structField.name}", "dd-MMM-yy hh.mm.ss a")"""
case _ => structField.name
}
But I get the following error at runtime
org.apache.spark.sql.AnalysisException: cannot resolve '`date_format($$"my_date_col", "dd-MMM-yy hh.mm.ss a")`' given input columns [mySchema.myTable.first_name, mySchema.myTable.my_date_col];
Is there a better way to do this?
Remove the double dollar sign and quotes. Also, no need to mkString; just use selectExpr:
def main (String[] args) {
val hiveSchema = args(0)
val hiveName = args(1)
val myDF = spark.table(s"${hiveSchema}.${hiveTable}")
val colArray = myDF.schema.fields.map(getColumns)
myDF.selectExpr(colArray: _*).write.format("csv").mode("overwrite").option("header", "true").save("/tmp/myDF")
}
def getColumns(structField: StructField): String = structField match {
case structField if(structField.dataType.simpleString.equalsIgnoreCase("TimestampType")) => s"""date_format(${structField.name}, "dd-MMM-yy hh.mm.ss a") as ${structField.name}"""
case _ => structField.name
}

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})

Scala ElasticSearch indexing in relation to dynamically changing schema

I found the following code below on CERN's website.
FYI: I am using spark 1.3
The example code is fantastic when you know the schema of the dataset you want to index to elasticsearch.
However, could somebody point me in the right direction so that I can create a method as follows:
Pass in as an argument the schema structure from external source (col name / datatype) (the hard bit) along with file name to be indexed (easy bit)?
Perform the schema mappings inside the function dynamically.
By having a method like this would allow me to generate a mapped and indexed dataset in ES.
Example Code:
//import elasticsearch packages
import org.elasticsearch.spark._
//define the schema
case class MemT(dt: String (link is external), server: String (link is external), memoryused: Integer (link is external))
//load the csv file into rdd
val Memcsv = sc.textFile("/tmp/flume_memusage.csv")
//split the fields, trim and map it to the schema
val MemTrdd = Memcsv.map(line=>line.split(",")).map(line=>MemT(line(0).trim.toString,line(1).trim.toString,line(2).trim.toInt))
//write the rdd to the elasticsearch
MemTrdd.saveToEs("fmem/logs")
Thank you!
source:
https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying
What i wanted to achieve is to be able index directly into ES from a DataFrame,.
I required that the index mappings be driven from external schema source. Here is how I achieved that....
BTW: There is additional validation/processing that I have omitted, but this skeleton code should get those who require similar needs going....
I have included the following ES dependency in my build.sbt file
"org.elasticsearch" % "elasticsearch-spark_2.10" % "2.3.3"
Comments welcome...
//Just showing the ES stuff
import org.elasticsearch.hadoop
import org.elasticsearch.spark._
//Declare a schema
val schemaString = "age:int, name:string,location:string"
//Fill RDD with dummy data
val rdd = sc.textFile("/path/to/your/file.csv")
val seperator = "," //This is seperator in csv
//Convert schema string above into a struct
val schema =
StructType(
schemaString.split(",").map(fieldName =>
StructField(fieldName.split(":")(0),
getFieldTypeInSchema(fieldName.split(":")(1)), true)))
//map each element of RDD row to RDD with elements
val rowRDDx =rdd.map(p => {
var list: collection.mutable.Seq[Any] = collection.mutable.Seq.empty[Any]
var index = 0
var tokens = p.split(seperator)
tokens.foreach(value => {
var valType = schema.fields(index).dataType
var returnVal: Any = null
valType match {
case IntegerType => returnVal = value.toString.toInt
case DoubleType => returnVal = value.toString.toDouble
case LongType => returnVal = value.toString.toLong
case FloatType => returnVal = value.toString.toFloat
case ByteType => returnVal = value.toString.toByte
case StringType => returnVal = value.toString
case TimestampType => returnVal = value.toString
}
list = list :+ returnVal
index += 1
})
Row.fromSeq(list)
})
//Convert the RDD with elements to a DF , also specify the intended schema
val df = sqlContext.createDataFrame(rowRDDx, schema)
//index the DF to ES
EsSparkSQL.saveToEs(df,"test/doc")