Dynamic schema generation from array of columns - scala

I had a list of columns, By using these columns prepared a schema
code :
import org.apache.spark.sql.types._
val fields = Array("col1", "col2", "col3", "col4", "col5", "col6")
val dynSchema = StructType( fields.map( field =>
new StructField(field, StringType, true, null) ) )
then schema prepared as
StructType(StructField(col1,StringType,true), StructField(col2,StringType,true),
StructField(col3,StringType,true), StructField(col4,StringType,true),
StructField(col5,StringType,true), StructField(col6,StringType,true))
But I am getting NullPointerException when I try to read the data from a json file using above schema.
// reading the data
spark.read.schema(dynSchema).json("./file/path/*.json")
But it is working if I add array to StructType.
Please help me to generate dynamic schema.
Edit : If i create schema with above fields, I can able to read the data from json.
StructType(Array(
StructField("col1",StringType,true), StructField("col2",StringType,true),
StructField("col3",StringType,true), StructField("col4",StringType,true),
StructField("col5",StringType,true), StructField("col6",StringType,true)))

Simply remove the null argument from the creation of the StructField as follows:
val dynSchema = StructType( fields.map( field =>
new StructField(field, StringType, true)))
The last argument is used to define metadata about the column. Its default value is not null but Metadata.empty. See the source code for more detail. In the source code, they assume that it cannot be null and call methods on it without any checks. This is why you get a NullPointerException.

Related

How to define a schema for json to be used in from_json to parse out values

I am trying to come up with a schema definition to parse out information from dataframe string column I am using from_json for that . I need help in defining schema which I am somehow not getting it right.
Here is the Json I have
[
{
"sectionid":"838096e332d4419191877a3fd40ed1f4",
"sequence":0,
"questions":[
{
"xid":"urn:com.mheducation.openlearning:lms.assessment.author:qastg.global:assessment_item:2a0f52fb93954f4590ac88d90888be7b",
"questionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"quizquestionid":"d36e1d7eeeae459c8db75c7d2dfd6ac6",
"qtype":"3",
"sequence":0,
"subsectionsequence":-1,
"type":"80",
"question":"<p>This is a simple, 1 question assessment for automation testing</p>",
"totalpoints":"5.0",
"scoring":"1",
"scoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"inputoption":"0",
"casesensitive":"0",
"suggestedscoring":"1",
"suggestedscoringrules":"{\"type\":\"perfect\",\"points\":5.0,\"pointsEach\":null,\"rules\":[]}",
"answers":[
"1"
],
"options":[
]
}
]
}
]
I want to parse this information out which will result in columns
sectionid , sequence, xid, question.sequence, question.question(question text), answers
Here is what I have I have defined a schema for testing like this
import org.apache.spark.sql.types.{StringType, ArrayType, StructType,
StructField}
val schema = new StructType()
.add("sectionid", StringType, true)
.add("sequence", StringType, true)
.add("questions", StringType, true)
.add("answers", StringType, true)
finalDF = finalDF
.withColumn( "parsed", from_json(col("enriched_payload.transformed"),schema) )
But I am getting NULL in result columns the reason I think is my schema is not right.
I am struggling to come up with right definition . How do I come up with correct json schema definition ?
I am using spark 3.0
Try below code.
import org.apache.spark.sql.types._
val schema = ArrayType(
new StructType()
.add("sectionid",StringType,true)
.add("sequence",LongType,true)
.add("questions", ArrayType(
new StructType()
.add("answers",ArrayType(StringType,true),true)
.add("casesensitive",StringType,true)
.add("inputoption",StringType,true)
.add("options",ArrayType(StringType,true),true)
.add("qtype",StringType,true)
.add("question",StringType,true)
.add("questionid",StringType,true)
.add("quizquestionid",StringType,true)
.add("scoring",StringType,true)
.add("scoringrules",StringType,true)
.add("sequence",LongType,true)
.add("subsectionsequence",LongType,true)
.add("suggestedscoring",StringType,true)
.add("suggestedscoringrules",StringType,true)
.add("totalpoints",StringType,true)
.add("type",StringType,true)
.add("xid",StringType,true)
)
)
)

Spark Scala read custom file format to dataframe with schema

I have an input file which looks much like csv but with custom header:
FIELDS-START
field1
field2
field3
FIELDS-END-DATA-START
val1,2,3
val2,4,5
DATA-END
Task:
To read data to a typed dataframe, schema is obtained dynamically, example for this specific file:
val schema = StructType(
StructField("field1", StringType, true) ::
StructField("field2", IntegerType, true) ::
StructField("field3", IntegerType, true) :: Nil
)
So bacause of custom header I can't use spark csv reader. Other thing I tried:
val file = spark.sparkContext.textFile(...)
val data: RDD[List[String]] = file.filter(_.contains(",")).map(_.split(',').toList)
val df: DataFrame = spark.sqlContext.createDataFrame(data.map(Row.fromSeq(_)), schema)
It fails with runtime exception
java.lang.String is not a valid external type for schema of int which is because createDataFrame doesn't do any casting.
NOTE: Schema is obtained at runtime
Thanks in advance!

Dynamically build case class or schema

Given a list of strings, is there a way to create a case class or a Schema without inputing the srings manually.
For eaxample, I have a List,
val name_list=Seq("Bob", "Mike", "Tim")
The List will not always be the same. Sometimes it will contain different names and will vary in size.
I can create a case class
case class names(Bob:Integer, Mike:Integer, Time:Integer)
or a schema
val schema = StructType(StructFiel("Bob", IntegerType,true)::
StructFiel("Mike", IntegerType,true)::
StructFiel("Tim", IntegerType,true)::Nil)
but I have to do it manually. I am looking for a method to perform this operation dynamically.
Assuming the data type of the columns are the same:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val nameList=Seq("Bob", "Mike", "Tim")
val schema = StructType(nameList.map(n => StructField(n, IntegerType, true)))
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(Bob,IntegerType,true), StructField(Mike,IntegerType,true), StructField(Tim,IntegerType,true)
// )
spark.createDataFrame(rdd, schema)
If the data types are different, you'll have to provide them as well (in which case it might not save much time compared with assembling the schema manually):
val typeList = Array[DataType](StringType, IntegerType, DoubleType)
val colSpec = nameList zip typeList
val schema = StructType(colSpec.map(cs => StructField(cs._1, cs._2, true)))
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(Bob,StringType,true), StructField(Mike,IntegerType,true), StructField(Tim,DoubleType,true)
// )
If you have all the fields with same datatype than you can simply create as
val name_list=Seq("Bob", "Mike", "Tim")
val fields = name_list.map(name => StructField(name, IntegerType, true))
val schema = StructType(fields)
If you have different datatype than create a map of fields and type and create a schema as above.
Hope this helps!
All the answers above only covered one aspect which is create the schema. Here is one solution you can use to create the case class from the generated schema:
https://gist.github.com/yoyama/ce83f688717719fc8ca145c3b3ff43fd

How to refer a field name in data frame if we don't have schema

How can we refer a particular field in dataframe if we don't have schema.
Can we refer some thing like col1,col2,col3,.....,etc instead of name.
I have a csv file like below.
arun|1001|hyd|x|y|z
suresh|1002|hyd|a|h|t
arun|1003|chn|e|g|e
suresh|1004|ban|r|f|w
How can i refer to first field and filter the records based on names to write it to a separate file.
All arun record and suresh I want to write to separate file like below.
arun|1001|hyd|x|y|z
arun|1003|chn|e|g|e
and
suresh|1002|hyd|a|h|t
suresh|1004|ban|r|f|w
Three options:
Use default column name, c1, c2, etc
Use getAs() method from Row, i.e. in filter or mapper
Programatically specify schema
In Scala schema is generated as:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
val schema = StructType(
StructField("name", StringType, false) ::
StructField("intValue", IntegerType, false) :: Nil)
val df = sqlContext.createDataFrame(dfFromLoad.rdd, schema)

How can I get Spark SQL to import a Long without an "L" suffix?

I have a set of CSV's that I produced by Sqoop'ing a mySQL database. I am trying to define them the source for a dataframe in Spark.
The schema in the source db contains several fields with a Long datatype, and actually stores giant numbers in those fields.
When trying to access the dataframe, Scala chokes on interpreting these because I do not have an L suffix on the long integers.
As an example, this throws an error: val test: Long = 20130102180600
While this succeeds: val test: Long = 20130102180600L
Is there any way to force Scala to interpret these as Long Integers without that suffix? Due to the scale of the data, I do not believe it is feasible to post-process the fields as they come out of the database.
Give the schema explicitly, as in the example in README:
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true),
StructField("comment", StringType, true),
StructField("blank", StringType, true)))
val df = sqlContext.load(
"com.databricks.spark.csv",
schema = customSchema,
Map("path" -> "cars.csv", "header" -> "true"))
val selectedData = df.select("year", "model")
selectedData.save("newcars.csv", "com.databricks.spark.csv")
Except using LongType for the large integer fields, of course.
Looking at the code, this definitely looks like it should work: fields are converted from String to desired type using TypeCast.castTo, and TypeCast.castTo for LongType just calls datum.toLong which works as desired (you can check "20130102180600".toLong in Scala REPL). In fact, InferSchema handles this case as well. I strongly suspect that the issue is different: perhaps the numbers are even out of Long range?
(I haven't actually tried this, but I expect it to work; if it doesn't, you should report the bug. Start by reading https://stackoverflow.com/help/mcve.)