Infer Schema from rdd to Dataframe in Spark Scala - scala

This question is a reference from (Spark - creating schema programmatically with different data types)
I am trying infer schema from rdd to Dataframe , Below is my code
def inferType(field: String) = field.split(":")(1) match {
case "Integer" => IntegerType
case "Double" => DoubleType
case "String" => StringType
case "Timestamp" => TimestampType
case "Date" => DateType
case "Long" => LongType
case _ => StringType
}
val header = c1:String|c2:String|c3:Double|c4:Integer|c5:String|c6:Timestamp|c7:Long|c8:Date
val df1 = Seq(("a|b|44.44|5|c|2018-01-01 01:00:00|456|2018-01-01")).toDF("data")
val rdd1 = df1.rdd.map(x => Row(x.getString(0).split("\\|"): _*))
val schema = StructType(header.split("\\|").map(column => StructField(column.split(":")(0), inferType(column), true)))
val df = spark.createDataFrame(rdd1, schema)
df.show()
When I do the show , it throws the below error . I have to perform this operation on larger scale data and having trouble finding the right solution, can you anybody please help me find a solution for this or any other way, where I can achieve this.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int
Thanks in advance

Short answer: String/Text cannot be specified with custom types/formats.
What you are trying to do is that to parse string as sql columns. The difference from other example is that loads from csv, you are trying to just.
Working version can be achieved like this:
// skipped other details such as schematype, spark session...
val header = "c1:String|c2:String|c3:Double|c4:Integer"
// Create `Row` from `Seq`
val row = Row.fromSeq(Seq("a|b|44.44|12|"))
// Create `RDD` from `Row`
val rdd: RDD[Row] = spark.sparkContext
.makeRDD(List(row))
.map { row =>
row.getString(0).split("\\|") match {
case Array(col1, col2, col3, col4) =>
Row.fromTuple(col1, col2, col3.toDouble, col4.toInt)
}
}
val stt: StructType = StructType(
header
.split("\\|")
.map(column => StructField(column, inferType(column), true))
)
val dataFrame = spark.createDataFrame(rdd, stt)
dataFrame.show()
The reason to create a Row from Scala types is that introducing compatible types or Row respected types here.
Note I skipped date and time related fields, date conversions are tricky. You can check my another answer how to use formatted date and timestamps here

Related

Spark cast all LongType columns to IntegerType dynamically

I dont have schema information because, i have different tables and the dataframe will be created with their data. So i want to detect any LongType column and cast it to IntegerType.
My approach is the create new dataFrame with the new schema which LongType fields converted to IntegerType.
val df = spark.read.format("bigquery").load(sql)
// cast long types to int
val newSchemaArr = df.schema.fields.map(f => if(f.dataType.isInstanceOf[LongType]) StructField(name = f.name, dataType = IntegerType, nullable = f.nullable) else f)
val newSchema = new StructType(newSchemaArr)
val df2 = spark.createDataFrame(df.rdd, newSchema)
// write to hdfs files
df2.write.format("avro").save(destinationPath)
But i got this error, when i writing the data.
Caused by: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of int
Is there any solution to fix, or any another approach to handle this problem?
Spark version: 3.2.0
Scala version: 2.12
The easiest way to do this is to simply cast columns when necessary:
// cast long columns to integer
val columns = df.schema.map {
case StructField(name, LongType, _, _) => col(name).cast(IntegerType)
case f => col(f.name)
}
// write modified columns
df.select(columns: _*).write.format("avro").save(destinationPath)
import org.apache.spark.sql.types.{IntegerType, LongType}
val sch = df.schema
val df2 = sch.fieldNames.foldLeft(df) { (tmpDF, colName) =>
if (tmpDF.schema(colName).dataType == LongType)
tmpDF.withColumn(colName, col(colName).cast(IntegerType))
else tmpDF
}

How to define schema of streaming dataset dynamically to write to csv?

I have a streaming dataset, reading from kafka and trying to write to CSV
case class Event(map: Map[String,String])
def decodeEvent(arrByte: Array[Byte]): Event = ...//some implementation
val eventDataset: Dataset[Event] = spark
.readStream
.format("kafka")
.load()
.select("value")
.as[Array[Byte]]
.map(decodeEvent)
Event holds Map[String,String] inside and to write to CSV I'll need some schema.
Let's say all the fields are of type String and so I tried the example from spark repo
val columns = List("year","month","date","topic","field1","field2")
val schema = new StructType() //Prepare schema programmatically
columns.foreach { field => schema.add(field, "string") }
val rowRdd = eventDataset.rdd.map { event => Row.fromSeq(
columns.map(c => event.getOrElse(c, "")
)}
val df = spark.sqlContext.createDataFrame(rowRdd, schema)
This gives error at runtime on line "eventDataset.rdd":
Caused by: org.apache.spark.sql.AnalysisException: Queries with
streaming sources must be executed with writeStream.start();;
Below doesn't work because '.map' has a List[String] not Tuple
eventDataset.map(event => columns.map(c => event.getOrElse(c,""))
.toDF(columns:_*)
Is there a way to achieve this with programmatic schema and structured streaming datasets?
I'd use much simpler approach:
import org.apache.spark.sql.functions._
eventDataset.select(columns.map(
c => coalesce($"map".getItem(c), lit("")).alias(c)
): _*).writeStream.format("csv").start(path)
but if you want something closer to the current solution skip RDD conversion
import org.apache.spark.sql.catalyst.encoders.RowEncoder
eventDataset.rdd.map(event =>
Row.fromSeq(columns.map(c => event.getOrElse(c,"")))
)(RowEncoder(schema)).writeStream.format("csv").start(path)

Changing columns that are string in Spark GraphFrame

I'm using GraphFrame in spark 2.0 and scala.
I need to remove double quote from columns that are in string type (out of many columns).
I'm trying to do so using UDF as follow:
import org.apache.spark.sql.functions.udf
val removeDoubleQuotes = udf( (x:Any) =>
x match{
case s:String => s.replace("\"","")
case other => other
}
)
And I get the following error since type Any is not supported in GraphFrame.
java.lang.UnsupportedOperationException: Schema for type Any is not
supported
What is a workaround for that?
I think you don't have a column with type Any and you can't return different datatype from udf. You need to have a single datatype return from udf.
If your column is String then you can create udf as
import org.apache.spark.sql.functions.udf
val removeDoubleQuotes = udf( (x:String) => s.replace("\"",""))

Spark, Scala - column type determine

I can load data from database, and I do some process with this data.
The problem is some table has date column as 'String', but some others trait it as 'timestamp'.
I cannot know what type of date column is until loading data.
> x.getAs[String]("date") // could be error when date column is timestamp type
> x.getAs[Timestamp]("date") // could be error when date column is string type
This is how I load data from spark.
spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", table)
.option("user", user)
.option("password", password)
.load()
Is there any way to trait them together? or convert it as string always?
You can pattern-match on the type of the column (using the DataFrame's schema) to decide whether to parse the String into a Timestamp or just use the Timestamp as is - and use the unix_timestamp function to do the actual conversion:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
// preparing some example data - df1 with String type and df2 with Timestamp type
val df1 = Seq(("a", "2016-02-01"), ("b", "2016-02-02")).toDF("key", "date")
val df2 = Seq(
("a", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-01").getTime)),
("b", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-02").getTime))
).toDF("key", "date")
// If column is String, converts it to Timestamp
def normalizeDate(df: DataFrame): DataFrame = {
df.schema("date").dataType match {
case StringType => df.withColumn("date", unix_timestamp($"date", "yyyy-MM-dd").cast("timestamp"))
case _ => df
}
}
// after "normalizing", you can assume date has Timestamp type -
// both would print the same thing:
normalizeDate(df1).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
normalizeDate(df2).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
Here are a few things you can try:
(1) Start utilizing the inferSchema function during load if you have a version that supports it. This will have spark figure the data type of columns, this doesn't work in all scenarios. Also look at the input data, if you have quotes I advise adding an extra argument to account for them during the load.
val inputDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load(fileLocation)
(2) To identify the data type of a column you can use the below code, it will place all of the column name and data types into their own Arrays of Strings.
val columnNames : Array[String] = inputDF.columns
val columnDataTypes : Array[String] = inputDF.schema.fields.map(x=>x.dataType).map(x=>x.toString)
It has a easy way to address this which is get(i: Int): Any. And it will be map between Spark SQL types and return types automatically. e.g.
val fieldIndex = row.fieldIndex("date")
val date = row.get(fieldIndex)
def parseLocationColumn(df: DataFrame): DataFrame = {
df.schema("location").dataType match {
case StringType => df.withColumn("locationTemp", $"location")
.withColumn("countryTemp", lit("Unknown"))
.withColumn("regionTemp", lit("Unknown"))
.withColumn("zoneTemp", lit("Unknown"))
case _ => df.withColumn("locationTemp", $"location.location")
.withColumn("countryTemp", $"location.country")
.withColumn("regionTemp", $"location.region")
.withColumn("zoneTemp", $"location.zone")
}
}

Spark SQL: automatic schema from csv

does spark sql provide any way to automatically load csv data?
I found the following Jira: https://issues.apache.org/jira/browse/SPARK-2360 but it was closed....
Currently I would load a csv file as follows:
case class Record(id: String, val1: String, val2: String, ....)
sc.textFile("Data.csv")
.map(_.split(","))
.map { r =>
Record(r(0),r(1), .....)
}.registerAsTable("table1")
Any hints on the automatic schema deduction from csv files? In particular a) how can I generate a class representing the schema and b) how can I automatically fill it (i.e. Record(r(0),r(1), .....))?
Update:
I found a partial answer to the schema generation here:
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#data-sources
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
So the only question left would be how to do the step
map(p => Row(p(0), p(1).trim)) dynamically for the given number of attributes?
Thanks for your support!
Joerg
You can use spark-csv where you can save a few keystrokes not having to define the column names and auto-use the headers.
val schemaString = "name age".split(" ")
// Generate the schema based on the string of schema
val schema = StructType(schemaString.map(fieldName => StructField(fieldName, StringType, true)))
val lines = people.flatMap(x=> x.split("\n"))
val rowRDD = lines.map(line=>{
Row.fromSeq(line.split(" "))
})
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)
May be this link will help you.
http://devslogics.blogspot.in/2014/11/spark-sql-automatic-schema-from-csv.html