How can I lit an Option when converting from Dataset to Dataframe - scala

So this is what I've been trying:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
​
val conf =
new SparkConf().setMaster("local[*]").setAppName("test")
.set("spark.ui.enabled", "false").set("spark.app.id", "testApp")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
​
case class B(s: String)
case class A(i: Int, b: Option[B])
​
val df = Seq(1,2,3).map(Tuple1.apply).toDF
​
// lit with a struct works fine
df.select(col("_1").as("i"), struct(lit("myString").as("s")).as("b")).as[A].show
​
/*
+---+-----------------+
| i| b|
+---+-----------------+
| 1|Some(B(myString))|
| 2|Some(B(myString))|
| 3|Some(B(myString))|
+---+-----------------+
*/
​
// lit with a null throws an exception
df.select(col("_1").as("i"), lit(null).as("b")).as[A].show
​
/*
org.apache.spark.sql.AnalysisException: Can't extract value from b#16;
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:475)
*/

Use correct types:
import org.apache.spark.sql.types._
val s = StructType(Seq(StructField("s", StringType)))
df.select(col("_1").as("i"), lit(null).cast(s).alias("b")).as[A].show
lit(null) alone is represented as NullType so it won't match expected type.

Related

How to set default value to 'null' in Dataset parsed from RDD[String] applying Case Class as schema

I am parsing JSON strings from a given RDD[String] and try to convert it into a Dataset with a given case class. However, when the JSON string does not contain all required fields of the case class I get an Exception that the missing column could not be found.
How can I define default values for such cases?
I tried defining default values in the case class but that did not solve the problem. I am working with Spark 2.3.2 and Scala 2.11.12.
This code is working fine
import org.apache.spark.rdd.RDD
case class SchemaClass(a: String, b: String)
val jsonData: String = """{"a": "foo", "b": "bar"}"""
val jsonRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonData))
import spark.implicits._
val ds = spark.read.json(jsonRddString).as[SchemaClass]
When I run this code
val jsonDataIncomplete: String = """{"a": "foo"}"""
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete))
import spark.implicits._
val dsIncomplete = spark.read.json(jsonIncompleteRddString).as[SchemaClass]
dsIncomplete.printSchema()
dsIncomplete.show()
I get the following Exception
org.apache.spark.sql.AnalysisException: cannot resolve '`b`' given input columns: [a];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:92)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:89)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
[...]
Interestingly, the default value "null" is applied when json strings are parsed from a file as the example given in the Spark documentation on Datasets is shown:
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Content of the json file
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
You can now skip loading json as RDD and then reading as DF to directly
val dsIncomplete = spark.read.json(Seq(jsonDataIncomplete).toDS) if you are using Spark 2.2+
Load your JSON data
Extract your schema from case class or define it manually
Get the missing field list
Default the value to lit(null).cast(col.dataType) for missing column.
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StructField, StructType}
object DefaultFieldValue {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val jsonDataIncomplete: String = """{"a": "foo"}"""
val dsIncomplete = spark.read.json(Seq(jsonDataIncomplete).toDS)
val schema: StructType = Encoders.product[SchemaClass].schema
val fields: Array[StructField] = schema.fields
val outdf = fields.diff(dsIncomplete.columns).foldLeft(dsIncomplete)((acc, col) => {
acc.withColumn(col.name, lit(null).cast(col.dataType))
})
outdf.printSchema()
outdf.show()
}
}
case class SchemaClass(a: String, b: Int, c: String, d: Double)
package spark
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Column, Encoders, SparkSession}
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.{col, lit}
object JsonDF extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class SchemaClass(a: String, b: Int)
val jsonDataIncomplete: String = """{"a": "foo", "m": "eee"}"""
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete))
val dsIncomplete = spark.read.json(jsonIncompleteRddString) // .as[SchemaClass]
lazy val schema: StructType = Encoders.product[SchemaClass].schema
lazy val fields: Array[String] = schema.fieldNames
lazy val colNames: Array[Column] = fields.map(col(_))
val sch = dsIncomplete.schema
val schemaDiff = schema.diff(sch)
val rr = schemaDiff.foldLeft(dsIncomplete)((acc, col) => {
acc.withColumn(col.name, lit(null).cast(col.dataType))
})
val schF = dsIncomplete.schema
val schDiff = schF.diff(schema)
val rrr = schDiff.foldLeft(rr)((acc, col) => {
acc.drop(col.name)
})
.select(colNames: _*)
}
It will work the same way if you have different json strings in the same RDD. When you have only one which is not matching with the schema then it will throw error.
Eg.
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete, jsonData))
import spark.implicits._
val dsIncomplete = spark.read.json(jsonIncompleteRddString).as[SchemaClass]
dsIncomplete.printSchema()
dsIncomplete.show()
scala> dsIncomplete.show()
+---+----+
| a| b|
+---+----+
|foo|null|
|foo| bar|
+---+----+
One way you can do is instead converting it as[Person] you can build schema(StructType) from it and apply it while reading the json files,
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.schema(schema).json(path).as[Person]
peopleDS.show
+-------+----+
| name| age|
+-------+----+
|Michael|null|
+-------+----+
Content of the code file is,
{"name":"Michael"}
The answer from #Sathiyan S led me to the following solution (presenting it here as it did not completely solved my problems but served as the pointer to the right direction):
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types.{StructField, StructType}
// created expected schema
val schema = Encoders.product[SchemaClass].schema
// convert all fields as nullable
val newSchema = StructType(schema.map {
case StructField( c, t, _, m) ⇒ StructField( c, t, nullable = true, m)
})
// apply expected and nullable schema for parsing json string
session.read.schema(newSchema).json(jsonIncompleteRddString).as[SchemaClass]
Benefits:
All missing fields are set to null, independent of data type
Additional fields in the json string, which are not part of the case class will be ignored

I tried to use groupBy on my dataframe after adding a new column but I faced the problem Task NotSerializable

This is my code, I am getting the Task Not Serializable Error when I do this result.groupBy("value")
object Test extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[4]")
.appName("https://SparkByExamples.com")
.getOrCreate()
import spark.implicits._
def myUDF = udf { (v: Double) =>
if (v < 0) 100
else 500
}
val central: DataFrame = Seq((1, 2014),(2, 2018)).toDF("key", "year1")
val other1: DataFrame = Seq((1, 2016),(2, 2015)).toDF("key", "year2")
val result = central.join(other1, Seq("key"))
.withColumn("value", myUDF(col("year2")))
result.show()
val result2 = result.groupBy("value")
.count()
result2.show()
}
I ran the same code I havent got any Task Not Serializable. Some where you have misconception.
import org.apache.log4j.Level
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Test extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark: SparkSession = SparkSession.builder()
.master("local[4]")
.appName("https://SparkByExamples.com")
.getOrCreate()
import spark.implicits._
def myUDF = udf { (v: Double) =>
if (v < 0) 100
else 500
}
val central: DataFrame = Seq((1, 2014),(2, 2018)).toDF("key", "year1")
val other1: DataFrame = Seq((1, 2016),(2, 2015)).toDF("key", "year2")
val result = central.join(other1, Seq("key"))
.withColumn("value", myUDF(col("year2")))
result.show()
val result2 = result.groupBy("value")
.count()
result2.show()
}
Result :
+---+-----+-----+-----+
|key|year1|year2|value|
+---+-----+-----+-----+
| 1| 2014| 2016| 500|
| 2| 2018| 2015| 500|
+---+-----+-----+-----+
+-----+-----+
|value|count|
+-----+-----+
| 500| 2|
+-----+-----+
Conclusion :
This kind of situations will arise when your spark version not compatible with your Scala version.
check this https://mvnrepository.com/artifact/org.apache.spark/spark-core for all versions and corresonding scala versions you need to use.

Error in spark scala ide while reading nested complete JSON file

I have a complex nested json data file as below and I am trying to consume the data and convert it as
per the below class
case class DeviceData (id: Int, device: String)
where id = 0 and
device = "{""device_id"": 0, ""device_type"": ""sensor-ipad"",""battery"":[{""type"": ""electrical""} ,{""type"": ""solar""}], ""ip"": ""68.161.225.1"", ""cca3"": ""USA"", ""cn"": ""United States"", ""temp"": 25, ""signal"": 23, ""battery_level"": 8, ""c02_level"": 917, ""timestamp"" :1475600496 }"
But I am stuck at the first step itself while consuming the data and converting them to a simple data frame and getting _corrupt_record error. Please advise what mistake I have made. I am using Spark version 2.4.5
export1.json
0,"{""device_id"": 0, ""device_type"": ""sensor-ipad"",""battery"":[{""type"": ""electrical""} ,{""type"": ""solar""}], ""ip"": ""68.161.225.1"", ""cca3"": ""USA"", ""cn"": ""United States"", ""temp"": 25, ""signal"": 23, ""battery_level"": 8, ""c02_level"": 917, ""timestamp"" :1475600496 }"
1,"{""device_id"": 1, ""device_type"": ""sensor-igauge"",""battery"":[{""type"": ""electrical""} ,{""type"": ""solar""}], ""ip"": ""213.161.254.1"", ""cca3"": ""NOR"", ""cn"": ""Norway"", ""temp"": 30, ""signal"": 18, ""battery_level"": 6, ""c02_level"": 1413, ""timestamp"" :1475600498 }"
2,"{""device_id"": 2, ""device_type"": ""sensor-ipad"",""battery"":[{""type"": ""electrical""} ,{""type"": ""solar""}], ""ip"": ""88.36.5.1"", ""cca3"": ""ITA"", ""cn"": ""Italy"", ""temp"": 18, ""signal"": 25, ""battery_level"": 5, ""c02_level"": 1372, ""timestamp"" :1475600500 }"
and my spark code is as below
package sparkWCExample.spWCExample
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Dataset
import java.util.Formatter.DateTime
import org.apache.spark.sql.types._ // include the Spark Types to define our schema
import org.apache.spark.sql.functions._ // include the Spark helper functions
import org.apache.spark.sql.functions.to_timestamp
case class DeviceData (id: Int, device: String)
object DatasetExample {
def main(args: Array[String]) {
println("Start now")
val conf = new SparkConf().setAppName("Spark Scala WordCount Example").setMaster("local[1]")
val spark = SparkSession.builder().config(conf).appName("CsvExample").master("local").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val readJSONDF = spark.read.json(sc.wholeTextFiles("C:\\Sankha\\Study\\data\\complex-nested-json\\export1.json").values).toDF()
println(readJSONDF.show())
}
}
I am getting the exception
+--------------------+
| _corrupt_record|
+--------------------+
|0,"{""device_id""...|
+--------------------+
sc.wholeTextFiles creates a PairRDD with the key being the file name and the value the content of the whole file. More details can be found here.
You might want to use spark.read.text and then split the lines afterwards:
val df = spark.read.text("export1.json")
.map(row => {
val s = row.getAs[String](0)
val index = s.indexOf(',')
DeviceData(s.substring(0, index).toInt, s.substring(index+1))
})
df.show
prints
+---+--------------------+
| id| device|
+---+--------------------+
| 0|"{""device_id"": ...|
| 1|"{""device_id"": ...|
| 2|"{""device_id"": ...|
+---+--------------------+

create a dataset with data frame from sequence of tuples with out using case class

I have sequence of tuples through which I made RDD and converted that to dataframe. like below.
val rdd = sc.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
now i want to create a dataset from df. How can I do that ?
simply df.as[(Int, String)] is what you need to do. pls see full example here.
package com.examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object SeqTuplesToDataSet {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName(this.getClass.getName).config("spark.master", "local").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val rdd = spark.sparkContext.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
val myds: Dataset[(Int, String)] = df.as[(Int, String)]
myds.show()
}
}
Result :
+---+---------+
| Id|firstname|
+---+---------+
| 1| User1|
| 2| user2|
| 3| user3|
+---+---------+

Spark Scala CSV Column names to Lower Case

Please find the code below and Let me know how I can change the Column Names to Lower case. I tried withColumnRename but I have to do it for each column and type all the column names. I just want to do it on columns so I don't want to mention all the column names as there are too many of them.
Scala Version: 2.11
Spark : 2.2
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object dataframeset {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val rdd1 = sc.cassandraTable("tdata", "map3")
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host","127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val df = spark1.read.format("csv").option("header","true").option("inferschema", "true").load("/Users/Desktop/del2.csv")
import spark1.implicits._
println("\nTop Records are:")
df.show(1)
val dfprev1 = df.select(col = "sno", "year", "StateAbbr")
dfprev1.show(1)
}
}
Required output:
|sno|year|stateabbr| statedesc|cityname|geographiclevel
All the Columns names should be in lower case.
Actual output:
Top Records are:
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
|sno|year|StateAbbr| StateDesc|CityName|GeographicLevel|DataSource| category|UniqueID| Measure|Data_Value_Unit|DataValueTypeID| Data_Value_Type|Data_Value|Low_Confidence_Limit|High_Confidence_Limit|Data_Value_Footnote_Symbol|Data_Value_Footnote|PopulationCount|GeoLocation|categoryID|MeasureId|cityFIPS|TractFIPS|Short_Question_Text|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
| 1|2014| US|United States| null| US| BRFSS|Prevention| 59|Current lack of h...| %| AgeAdjPrv|Age-adjusted prev...| 14.9| 14.6| 15.2| null| null| 308745538| null| PREVENT| ACCESS2| null| null| Health Insurance|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
only showing top 1 row
+---+----+---------+
|sno|year|StateAbbr|
+---+----+---------+
| 1|2014| US|
+---+----+---------+
only showing top 1 row
Just use toDF:
df.toDF(df.columns map(_.toLowerCase): _*)
Other way to achieve it is using FoldLeft method.
val myDFcolNames = myDF.columns.toList
val rdoDenormDF = myDFcolNames.foldLeft(myDF)((myDF, c) =>
myDF.withColumnRenamed(c.toString.split(",")(0), c.toString.toLowerCase()))