spark convert dataframe to dataset using case class with option fields - scala

I have the following case class:
case class Person(name: String, lastname: Option[String] = None, age: BigInt) {}
And the following json:
{ "name": "bemjamin", "age" : 1 }
When I try to transform my dataframe into a dataset:
spark.read.json("example.json")
.as[Person].show()
It shows me the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'lastname' given input columns: [age, name];
My question is: If my schema is my case class and it defines that the lastname is optional, shouldn't the as() do the conversion?
I can easily fix this using a .map but I would like to know if there is another cleaner alternative to this.

We have one more option to solve above issue.There are 2 steps required
Make sure that fields that can be missing are declared as nullable
Scala types(like Option[_]).
Provide a schema argument and not depend on schema inference.You can use for example use Spark SQL Encoder:
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
You can update code as below.
val schema = Encoders.product[Person].schema
val df = spark.read
.schema(schema)
.json("/Users/../Desktop/example.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+

When you are performing spark.read.json("example.json").as[Person].show(), it is basically reading the dataframe as ,
FileScan json [age#6L,name#7]
and then trying to apply the Encoders for Person object hence getting the AnalysisException as it is not able to find lastname from your json file.
Either you could hint spark that lastname is optional by supplying some data that has lastname or
try this:
val schema: StructType = ScalaReflection.schemaFor[Person].dataType.asInstanceOf[StructType]
val x = spark.read
.schema(schema)
.json("src/main/resources/json/x.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+
Hope it helps.

Related

How best to handle schema conflicts converting MongoRDD to DataFrame?

I'm trying to read some documents from a mongo database and parse the schema in a spark DataFrame. So far I have had success reading from mongo and transforming the resulting mongoRDD into a DataFrame using a schema defined by case classes, but there's a scenario where the mongo collection has a field containing multiple datatypes (array of strings vs. array of nested objects). So far I have been simply parsing the field as a string, then using spark sql's from_json() to parse the nested objects in the new schema, but I am finding that when a field does not conform to the schema, it returns null for all fields in the schema - not simply the field that does not conform. Is there a way to parse this so that only fields not matching the schema will return null?
//creating mongo test data in mongo shell
db.createCollection("testColl")
db.testColl.insertMany([
{ "foo" : ["fooString1", "fooString2"], "bar" : "barString"},
{ "foo" : [{"uid" : "fooString1"}, {"uid" : "fooString2"}], "bar" : "barString"}
])
import com.mongodb.spark.config.ReadConfig
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.functions._
import com.mongodb.spark.MongoSpark
import org.apache.spark.sql.types.{StringType, StructField, StructType}
//mongo connector and read config
val testConfig = ReadConfig(Map("uri" -> "mongodb://some.mongo.db",
"database" -> "testDB",
"collection" -> "testColl"
))
//Option 1: 'lowest common denominator' case class - works, but leaves the nested struct type value as json that then needs additional parsing
case class stringArray (foo: Option[Seq[String]], bar: Option[String])
val df1 : DataFrame = MongoSpark.load(spark.sparkContext, testConfig).toDF[stringArray]
df1.show()
+--------------------+---------+
| foo| bar|
+--------------------+---------+
|[fooString1, fooS...|barString|
|[{ "uid" : "fooSt...|barString|
+--------------------+---------+
//Option 2: accurate case class - fails with:
//com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType(StructField(uid,StringType,true)) (value: BsonString{value='fooString1'})
case class fooDoc (uid: Option[String])
case class docArray (foo: Option[Seq[fooDoc]], bar: Option[String])
val df2 : DataFrame = MongoSpark.load(spark.sparkContext, testConfig).toDF[docArray]
//Option 3: map all rows to json string, then use from_json - why does return null for 'bar' in the case of the schema that doesn't fit?
val mrdd = MongoSpark.load(spark.sparkContext, testConfig)
val jsonRDD = mrdd.map(x => Row(x.toJson()))
val simpleSchema = StructType(Seq(StructField("wholeRecordJson", StringType, true)))
val schema = ScalaReflection.schemaFor[docArray].dataType.asInstanceOf[StructType]
val jsonDF = spark.createDataFrame(jsonRDD, simpleSchema)
val df3 = jsonDF.withColumn("parsed",from_json($"wholeRecordJson", schema))
df3.select("parsed.foo", "parsed.bar").show()
+--------------------+---------+
| foo| bar|
+--------------------+---------+
| null| null|
|[[fooString1], [f...|barString|
+--------------------+---------+
//Desired results:
//desired outcome is for only the field not matching the schema (string type of 'foo') is null, but matching columns are populated
+--------------------+---------+
| foo| bar|
+--------------------+---------+
| null|barString|
|[[fooString1], [f...|barString|
+--------------------+---------+
No, there is no easy way to do this as having merge incompatible schema in the same document collection is an anti-pattern, even in Mongo.
There are three main approaches to deal with this:
Fix the data in MongoDB.
Issue a query that "normalizes" the Mongo schema, e.g., drops fields with incompatible types or converts them or renames them, etc.
Issue separate queries to Mongo for documents of a particular schema type. (Mongo has query operators that can filter based on the type of a field.) Then post-process in Spark and, finally, union the data into a single Spark dataset.

spark Scala RDD to DataFrame Date format

Would you be able to help in this spark prob statement
Data -
empno|ename|designation|manager|hire_date|sal|deptno
7369|SMITH|CLERK|9902|2010-12-17|800.00|20
7499|ALLEN|SALESMAN|9698|2011-02-20|1600.00|30
Code:
val rawrdd = spark.sparkContext.textFile("C:\\Users\\cmohamma\\data\\delta scenarios\\emp_20191010.txt")
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|") (fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4).toDate,fields(5).toFloat,fields(6).toInt)
})
Problem Statement - This is not working -fields(4).toDate , whats is the alternative or what is the usage ?
What i have tried ?
tried replacing it to - to_date(col(fields(4)) , "yyy-MM-dd") - Not working
2.
Step 1.
val refinedRDD = rawrdd.map( lines => {
val fields = lines.split("\\|")
(fields(0),fields(1),fields(2),fields(3),fields(4),fields(5),fields(6))
})
Now this tuples are all strings
Step 2.
mySchema = StructType(StructField(empno,IntegerType,true), StructField(ename,StringType,true), StructField(designation,StringType,true), StructField(manager,IntegerType,true), StructField(hire_date,DateType,true), StructField(sal,DoubleType,true), StructField(deptno,IntegerType,true))
Step 3. converting the string tuples to Rows
val rowRDD = refinedRDD.map(attributes => Row(attributes._1, attributes._2, attributes._3, attributes._4, attributes._5 , attributes._6, attributes._7))
Step 4.
val empDF = spark.createDataFrame(rowRDD, mySchema)
This is also not working and gives error related to types. to solve this i changed the step 1 as
(fields(0).toInt,fields(1),fields(2),fields(3).toInt,fields(4),fields(5).toFloat,fields(6).toInt)
Now this is giving error for the date type column and i am again at the main problem.
Use Case - use textFile Api, convert this to a dataframe using custom schema (StructType) on top of it.
This can be done using the case class but in case class also i would be stuck where i would need to do a fields(4).toDate (i know i can cast string to date later in code but if the above problem solutionis possible)
You can use the following code snippet
import org.apache.spark.sql.functions.to_timestamp
scala> val df = spark.read.format("csv").option("header", "true").option("delimiter", "|").load("gs://otif-etl-input/test.csv")
df: org.apache.spark.sql.DataFrame = [empno: string, ename: string ... 5 more fields]
scala> val ts = to_timestamp($"hire_date", "yyyy-MM-dd")
ts: org.apache.spark.sql.Column = to_timestamp(`hire_date`, 'yyyy-MM-dd')
scala> val enriched_df = df.withColumn("ts", ts).show(2, false)
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|empno|ename|designation|manager|hire_date |sal |deptno |ts |
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
|7369 |SMITH|CLERK |9902 |2010-12-17|800.00 |20 |2010-12-17 00:00:00|
|7499 |ALLEN|SALESMAN |9698 |2011-02-20|1600.00|30 |2011-02-20 00:00:00|
+-----+-----+-----------+-------+----------+-------+----------+-------------------+
enriched_df: Unit = ()
There are multiple ways to cast your data to proper data types.
First : use InferSchema
val df = spark.read .option("delimiter", "\\|").option("header", true) .option("inferSchema", "true").csv(path)
df.printSchema
Some time it doesn't work as expected. see details here
Second : provide your own Datatype conversion template
val rawDF = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
//define schema in DF , hire_date as Date
val schemaDF = Seq(("empno", "INT"), ("ename", "STRING"), (**"hire_date", "date"**) , ("sal", "double")).toDF("columnName", "columnType")
rawDF.printSchema
//fetch schema details
val dataTypes = schemaDF.select("columnName", "columnType")
val listOfElements = dataTypes.collect.map(_.toSeq.toList)
//creating a map friendly template
val validationTemplate = (c: Any, t: Any) => {
val column = c.asInstanceOf[String]
val typ = t.asInstanceOf[String]
col(column).cast(typ)
}
//Apply datatype conversion template on rawDF
val convertedDF = rawDF.select(listOfElements.map(element => validationTemplate(element(0), element(1))): _*)
println("Conversion done!")
convertedDF.show()
convertedDF.printSchema
Third : Case Class
Create schema from caseclass with ScalaReflection and provide this customized schema while loading DF.
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types._
case class MySchema(empno: int, ename: String, hire_date: Date, sal: Double)
val schema = ScalaReflection.schemaFor[MySchema].dataType.asInstanceOf[StructType]
val rawDF = spark.read.schema(schema).option("header", "true").load(path)
rawDF.printSchema
Hope this will help.

Pass Spark SQL function name as parameter in Scala

I am trying to pass a Spark SQL function name to my defined function in Scala.
I am trying to get same functionality as:
myDf.agg(max($"myColumn"))
my attempt doesn't work:
def myFunc(myDf: DataFrame, myParameter: String): Dataframe = {
myDf.agg(myParameter($"myColumn"))
}
Obviously it shouldn't work as I'm providing a string type I am unable to find a way to make it work.
Is it even possible?
Edit:
I have to provide sql function name (and it can be other aggregate function) as parameter when calling my function.
myFunc(anyDf, max) or myFunc(anyDf, "max")
agg also takes a Map[String,String] which allows to do what you want:
def myFunc(myDf: DataFrame, myParameter: String): DataFrame = {
myDf.agg(Map("myColumn"->myParameter))
}
example:
val df = Seq(1.0,2.0,3.0).toDF("myColumn")
myFunc(df,"avg")
.show()
gives:
+-------------+
|avg(myColumn)|
+-------------+
| 2.0|
+-------------+
Try this:
import org.apache.spark.sql.{Column, DataFrame}
val df = Seq((1, 2, 12),(2, 1, 21),(1, 5, 10),(5, 3, 9),(2, 5, 4)).toDF("a","b","c")
def myFunc(df: DataFrame, f: Column): DataFrame = {
df.agg(f)
}
myFunc(df, max(col("a"))).show
+------+
|max(a)|
+------+
| 5|
+------+
Hope it helps!

How to pass dataset column value to a function while using spark filter with scala?

I have an action array which consists of user id and action type
+-------+-------+
|user_id| type|
+-------+-------+
| 11| SEARCH|
+-------+-------+
| 11| DETAIL|
+-------+-------+
| 12| SEARCH|
+-------+-------+
I want to filter actions that belongs to the users who have at least one search action.
So I created a bloom filter with user ids who has SEARCH action.
Then I tried to filter all actions depending on bloom filter's user status
val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))
But the code gives an exception
type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how can I pass column value to the BloomFilter.mightContainString method?
Create filter:
val expectedNumItems: Long = ???
val fpp: Double = ???
val f = df.stat.bloomFilter("user_id", expectedNumItems, fpp)
Use udf for filtering:
import org.apache.spark.sql.functions.udf
val mightContain = udf((s: String) => f.mightContain(s))
df.filter(mightContain($"user_id"))
If your current Bloom filter implementation is serializable you should be able to use it the same way, but if data is large enough to justify Bloom filter, you should avoid collecting.
You can do something like this,
val sparkSession = ???
val sc = sparkSession.sparkContext
val bloomFilter = BloomFilter.create(100)
val df = ???
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
At this point, i'll mention the fact that collect is not a good idea. Next you can do something like.
import org.apache.spark.sql.functions.udf
val bbFilter = sc.broadcast(bloomFilter)
val filterUDF = udf((s: String) => bbFilter.value.mightContainString(s))
df.filter(filterUDF($"user_id"))
You can remove the broadcasting if the bloomFilter instance is serializable.
Hope this helps, Cheers.

Cannot access Spark dataframe methods

In Zeppelin I am using a dataframe created in another paragraph. I display the type of my df variable and get:
res35: String = DataFrame
suggesting it is a dataframe. But when I try and use select on the df variable I get an error:
<console>:62: error: value select is not a member of Object
Do I have to convert Object to Dataframe or something? Can someone tell me what I am missing? TIA!
My code is:
val df = z.get("wds")
df.getClass.getSimpleName
df.select(explode($"filtered").as("value")).groupBy("value").count.show
This gives the folowwing (edited) output:
df: Object = [racist: boolean, contributors:
string, coordinates: string, ...n: Int = 20
res35: String = DataFrame
<console>:62: error: value select is not a member of Object
df.select(explode($"filtered").as("value")).groupBy("value").count.show
Seems I was missing
.asInstanceOf[DataFrame]
i.e.
import org.apache.spark.sql.DataFrame
val df = z.get("wds").asInstanceOf[DataFrame]