I have a complex nested json data file as below and I am trying to consume the data and convert it as
per the below class
case class DeviceData (id: Int, device: String)
where id = 0 and
device = "{""device_id"": 0, ""device_type"": ""sensor-ipad"",""battery"":[{""type"": ""electrical""} ,{""type"": ""solar""}], ""ip"": ""68.161.225.1"", ""cca3"": ""USA"", ""cn"": ""United States"", ""temp"": 25, ""signal"": 23, ""battery_level"": 8, ""c02_level"": 917, ""timestamp"" :1475600496 }"
But I am stuck at the first step itself while consuming the data and converting them to a simple data frame and getting _corrupt_record error. Please advise what mistake I have made. I am using Spark version 2.4.5
export1.json
0,"{""device_id"": 0, ""device_type"": ""sensor-ipad"",""battery"":[{""type"": ""electrical""} ,{""type"": ""solar""}], ""ip"": ""68.161.225.1"", ""cca3"": ""USA"", ""cn"": ""United States"", ""temp"": 25, ""signal"": 23, ""battery_level"": 8, ""c02_level"": 917, ""timestamp"" :1475600496 }"
1,"{""device_id"": 1, ""device_type"": ""sensor-igauge"",""battery"":[{""type"": ""electrical""} ,{""type"": ""solar""}], ""ip"": ""213.161.254.1"", ""cca3"": ""NOR"", ""cn"": ""Norway"", ""temp"": 30, ""signal"": 18, ""battery_level"": 6, ""c02_level"": 1413, ""timestamp"" :1475600498 }"
2,"{""device_id"": 2, ""device_type"": ""sensor-ipad"",""battery"":[{""type"": ""electrical""} ,{""type"": ""solar""}], ""ip"": ""88.36.5.1"", ""cca3"": ""ITA"", ""cn"": ""Italy"", ""temp"": 18, ""signal"": 25, ""battery_level"": 5, ""c02_level"": 1372, ""timestamp"" :1475600500 }"
and my spark code is as below
package sparkWCExample.spWCExample
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Dataset
import java.util.Formatter.DateTime
import org.apache.spark.sql.types._ // include the Spark Types to define our schema
import org.apache.spark.sql.functions._ // include the Spark helper functions
import org.apache.spark.sql.functions.to_timestamp
case class DeviceData (id: Int, device: String)
object DatasetExample {
def main(args: Array[String]) {
println("Start now")
val conf = new SparkConf().setAppName("Spark Scala WordCount Example").setMaster("local[1]")
val spark = SparkSession.builder().config(conf).appName("CsvExample").master("local").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val readJSONDF = spark.read.json(sc.wholeTextFiles("C:\\Sankha\\Study\\data\\complex-nested-json\\export1.json").values).toDF()
println(readJSONDF.show())
}
}
I am getting the exception
+--------------------+
| _corrupt_record|
+--------------------+
|0,"{""device_id""...|
+--------------------+
sc.wholeTextFiles creates a PairRDD with the key being the file name and the value the content of the whole file. More details can be found here.
You might want to use spark.read.text and then split the lines afterwards:
val df = spark.read.text("export1.json")
.map(row => {
val s = row.getAs[String](0)
val index = s.indexOf(',')
DeviceData(s.substring(0, index).toInt, s.substring(index+1))
})
df.show
prints
+---+--------------------+
| id| device|
+---+--------------------+
| 0|"{""device_id"": ...|
| 1|"{""device_id"": ...|
| 2|"{""device_id"": ...|
+---+--------------------+
Related
I am parsing JSON strings from a given RDD[String] and try to convert it into a Dataset with a given case class. However, when the JSON string does not contain all required fields of the case class I get an Exception that the missing column could not be found.
How can I define default values for such cases?
I tried defining default values in the case class but that did not solve the problem. I am working with Spark 2.3.2 and Scala 2.11.12.
This code is working fine
import org.apache.spark.rdd.RDD
case class SchemaClass(a: String, b: String)
val jsonData: String = """{"a": "foo", "b": "bar"}"""
val jsonRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonData))
import spark.implicits._
val ds = spark.read.json(jsonRddString).as[SchemaClass]
When I run this code
val jsonDataIncomplete: String = """{"a": "foo"}"""
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete))
import spark.implicits._
val dsIncomplete = spark.read.json(jsonIncompleteRddString).as[SchemaClass]
dsIncomplete.printSchema()
dsIncomplete.show()
I get the following Exception
org.apache.spark.sql.AnalysisException: cannot resolve '`b`' given input columns: [a];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:92)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:89)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$11.apply(TreeNode.scala:335)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
[...]
Interestingly, the default value "null" is applied when json strings are parsed from a file as the example given in the Spark documentation on Datasets is shown:
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Content of the json file
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
You can now skip loading json as RDD and then reading as DF to directly
val dsIncomplete = spark.read.json(Seq(jsonDataIncomplete).toDS) if you are using Spark 2.2+
Load your JSON data
Extract your schema from case class or define it manually
Get the missing field list
Default the value to lit(null).cast(col.dataType) for missing column.
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StructField, StructType}
object DefaultFieldValue {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val jsonDataIncomplete: String = """{"a": "foo"}"""
val dsIncomplete = spark.read.json(Seq(jsonDataIncomplete).toDS)
val schema: StructType = Encoders.product[SchemaClass].schema
val fields: Array[StructField] = schema.fields
val outdf = fields.diff(dsIncomplete.columns).foldLeft(dsIncomplete)((acc, col) => {
acc.withColumn(col.name, lit(null).cast(col.dataType))
})
outdf.printSchema()
outdf.show()
}
}
case class SchemaClass(a: String, b: Int, c: String, d: Double)
package spark
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Column, Encoders, SparkSession}
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.{col, lit}
object JsonDF extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
case class SchemaClass(a: String, b: Int)
val jsonDataIncomplete: String = """{"a": "foo", "m": "eee"}"""
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete))
val dsIncomplete = spark.read.json(jsonIncompleteRddString) // .as[SchemaClass]
lazy val schema: StructType = Encoders.product[SchemaClass].schema
lazy val fields: Array[String] = schema.fieldNames
lazy val colNames: Array[Column] = fields.map(col(_))
val sch = dsIncomplete.schema
val schemaDiff = schema.diff(sch)
val rr = schemaDiff.foldLeft(dsIncomplete)((acc, col) => {
acc.withColumn(col.name, lit(null).cast(col.dataType))
})
val schF = dsIncomplete.schema
val schDiff = schF.diff(schema)
val rrr = schDiff.foldLeft(rr)((acc, col) => {
acc.drop(col.name)
})
.select(colNames: _*)
}
It will work the same way if you have different json strings in the same RDD. When you have only one which is not matching with the schema then it will throw error.
Eg.
val jsonIncompleteRddString: RDD[String] = spark.sparkContext.parallelize(List(jsonDataIncomplete, jsonData))
import spark.implicits._
val dsIncomplete = spark.read.json(jsonIncompleteRddString).as[SchemaClass]
dsIncomplete.printSchema()
dsIncomplete.show()
scala> dsIncomplete.show()
+---+----+
| a| b|
+---+----+
|foo|null|
|foo| bar|
+---+----+
One way you can do is instead converting it as[Person] you can build schema(StructType) from it and apply it while reading the json files,
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.schema(schema).json(path).as[Person]
peopleDS.show
+-------+----+
| name| age|
+-------+----+
|Michael|null|
+-------+----+
Content of the code file is,
{"name":"Michael"}
The answer from #Sathiyan S led me to the following solution (presenting it here as it did not completely solved my problems but served as the pointer to the right direction):
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types.{StructField, StructType}
// created expected schema
val schema = Encoders.product[SchemaClass].schema
// convert all fields as nullable
val newSchema = StructType(schema.map {
case StructField( c, t, _, m) ⇒ StructField( c, t, nullable = true, m)
})
// apply expected and nullable schema for parsing json string
session.read.schema(newSchema).json(jsonIncompleteRddString).as[SchemaClass]
Benefits:
All missing fields are set to null, independent of data type
Additional fields in the json string, which are not part of the case class will be ignored
I have sequence of tuples through which I made RDD and converted that to dataframe. like below.
val rdd = sc.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
now i want to create a dataset from df. How can I do that ?
simply df.as[(Int, String)] is what you need to do. pls see full example here.
package com.examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object SeqTuplesToDataSet {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName(this.getClass.getName).config("spark.master", "local").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val rdd = spark.sparkContext.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
val myds: Dataset[(Int, String)] = df.as[(Int, String)]
myds.show()
}
}
Result :
+---+---------+
| Id|firstname|
+---+---------+
| 1| User1|
| 2| user2|
| 3| user3|
+---+---------+
I have a dataset that I am trying to flatten using scala.
+---------+-----------+--------+
|visitorId|trackingIds|emailIds|
+---------+-----------+--------+
| a | 666b| 12|
| 7 | c0b5| 45|
| 7 | c0b4| 87|
| a | 666b,7p88| |
+---------+-----------+--------+
I am trying to achieve a dataframe which is grouped by the visitorID. Below is the format
+---------+---------------------+--------+
|visitorId| trackingIds |emailIds|
+---------+---------------------+--------+
| a | 666b,666b,7p88| 12,87|
| 7 | c0b4,c0b5 | 45|
+---------+---------------------+--------+
My code:
object flatten_data{
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local[5]")
.appName("Flatten_DF")
.enableHiveSupport()
.getOrCreate()
val df = spark.read.format("csv")
.option("header","true")
.option("delimiter",",")
.load("/home/cloudera/Desktop/data.txt")
print(df.show())
val flattened = df.groupBy("visitorID").agg(collect_list("trackingIds"))
}
}
I am using IntelliJ Idea and I am getting an error at "collect_list".
I read through many solution on stackoverflow where people have asked on how to flatten and groupbykey and have used the same collect_list. I am not sure why this is not working for me. Is it because of IntelliJ?
I reworked on your code and this seems to be working:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object flatten_data{
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val someDF = Seq(
("a", "666b",12),
("7", "c0b5",45),
("7", "666b,7p88",10)
).toDF("visitorId","trackingIds","emailIds")
someDF.groupBy("visitorID").agg(collect_list("trackingIds")).show()
}
}
collect_list is a method defined in the org.apache.spark.sql.functions object, so you need to import it:
import org.apache.spark.sql.functions.collect_list
Alternatively, you can import the entire object, then you'll be able to use other functions from there as well:
import org.apache.spark.sql.functions._
Finally, the approach I personally prefer is to import functions as f, and use qualified calls:
import org.apache.spark.sql.{functions => f}
agg(f.collect_list(...))
This way, the global namespace inside the file is not polluted by the entire host of functions defined in functions.
I'm new to scala programming. I have a usecase to retrieve a column value in to a variable based on another column value in a dataframe
This is on scala.
I have the following data frame
I need to get the value of the column location in to a variable based on column name passed in.
i.e. if the passed in name is 'xxx' I need the value 'India' in to a variable from the data frame.
If I really understand what you mean it's just a filter and select the corresponding value of location.
The follow code are an example
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.DataTypes._
import org.apache.spark.sql.types.{StructField, StructType}
import org.apache.spark.sql.functions.col
import org.scalatest.FunSuite
class FilterTest extends FunSuite {
test("filter test") {
val spark = SparkSession.builder()
.master("local")
.appName("filter test")
.getOrCreate()
val schema = StructType(
Seq(
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("location", StringType, true)
)
)
val data = Seq(
Row("XXX", 34, "India"),
Row("YYY", 42, "China"),
Row("ZZZ", 36, "America")
)
val dataset = spark.createDataset(data)(RowEncoder(schema))
val value = dataset.filter(col("name") === "XXX").first().getAs[String]("location")
assert(value == "India")
}
}
Assuming, the value that is passed is unique to the dataframe otherwise multiple rows will be returned and you've to handle other way. Here is the way how you can solve it:
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(("XXX",34, "India"), ("YYY", 42, "China"), ("ZZZ", 36, "America")).toDF("name", "age", "location")
scala> df.show()
+----+---+--------+
|name|age|location|
+----+---+--------+
| XXX| 34| India|
| YYY| 42| China|
| ZZZ| 36| America|
+----+---+--------+
scala> val input = "XXX"
input: String = XXX
scala> val location = df.filter(s"name = '$input'").select("location").collect()(0).getString(0)
location: String = India
Hopefully that will solve your requirement....
You can use filter to get row where column name value is xxx. Once you have row you can display any column of that row.
var filteredRows = dataFrame.filter(row => {
row.get(0).equals("XXX")
})
filteredRows.rdd.first().get(2)
So this is what I've been trying:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
val conf =
new SparkConf().setMaster("local[*]").setAppName("test")
.set("spark.ui.enabled", "false").set("spark.app.id", "testApp")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
case class B(s: String)
case class A(i: Int, b: Option[B])
val df = Seq(1,2,3).map(Tuple1.apply).toDF
// lit with a struct works fine
df.select(col("_1").as("i"), struct(lit("myString").as("s")).as("b")).as[A].show
/*
+---+-----------------+
| i| b|
+---+-----------------+
| 1|Some(B(myString))|
| 2|Some(B(myString))|
| 3|Some(B(myString))|
+---+-----------------+
*/
// lit with a null throws an exception
df.select(col("_1").as("i"), lit(null).as("b")).as[A].show
/*
org.apache.spark.sql.AnalysisException: Can't extract value from b#16;
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:475)
*/
Use correct types:
import org.apache.spark.sql.types._
val s = StructType(Seq(StructField("s", StringType)))
df.select(col("_1").as("i"), lit(null).cast(s).alias("b")).as[A].show
lit(null) alone is represented as NullType so it won't match expected type.