How best to handle schema conflicts converting MongoRDD to DataFrame? - mongodb

I'm trying to read some documents from a mongo database and parse the schema in a spark DataFrame. So far I have had success reading from mongo and transforming the resulting mongoRDD into a DataFrame using a schema defined by case classes, but there's a scenario where the mongo collection has a field containing multiple datatypes (array of strings vs. array of nested objects). So far I have been simply parsing the field as a string, then using spark sql's from_json() to parse the nested objects in the new schema, but I am finding that when a field does not conform to the schema, it returns null for all fields in the schema - not simply the field that does not conform. Is there a way to parse this so that only fields not matching the schema will return null?
//creating mongo test data in mongo shell
db.createCollection("testColl")
db.testColl.insertMany([
{ "foo" : ["fooString1", "fooString2"], "bar" : "barString"},
{ "foo" : [{"uid" : "fooString1"}, {"uid" : "fooString2"}], "bar" : "barString"}
])
import com.mongodb.spark.config.ReadConfig
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.functions._
import com.mongodb.spark.MongoSpark
import org.apache.spark.sql.types.{StringType, StructField, StructType}
//mongo connector and read config
val testConfig = ReadConfig(Map("uri" -> "mongodb://some.mongo.db",
"database" -> "testDB",
"collection" -> "testColl"
))
//Option 1: 'lowest common denominator' case class - works, but leaves the nested struct type value as json that then needs additional parsing
case class stringArray (foo: Option[Seq[String]], bar: Option[String])
val df1 : DataFrame = MongoSpark.load(spark.sparkContext, testConfig).toDF[stringArray]
df1.show()
+--------------------+---------+
| foo| bar|
+--------------------+---------+
|[fooString1, fooS...|barString|
|[{ "uid" : "fooSt...|barString|
+--------------------+---------+
//Option 2: accurate case class - fails with:
//com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType(StructField(uid,StringType,true)) (value: BsonString{value='fooString1'})
case class fooDoc (uid: Option[String])
case class docArray (foo: Option[Seq[fooDoc]], bar: Option[String])
val df2 : DataFrame = MongoSpark.load(spark.sparkContext, testConfig).toDF[docArray]
//Option 3: map all rows to json string, then use from_json - why does return null for 'bar' in the case of the schema that doesn't fit?
val mrdd = MongoSpark.load(spark.sparkContext, testConfig)
val jsonRDD = mrdd.map(x => Row(x.toJson()))
val simpleSchema = StructType(Seq(StructField("wholeRecordJson", StringType, true)))
val schema = ScalaReflection.schemaFor[docArray].dataType.asInstanceOf[StructType]
val jsonDF = spark.createDataFrame(jsonRDD, simpleSchema)
val df3 = jsonDF.withColumn("parsed",from_json($"wholeRecordJson", schema))
df3.select("parsed.foo", "parsed.bar").show()
+--------------------+---------+
| foo| bar|
+--------------------+---------+
| null| null|
|[[fooString1], [f...|barString|
+--------------------+---------+
//Desired results:
//desired outcome is for only the field not matching the schema (string type of 'foo') is null, but matching columns are populated
+--------------------+---------+
| foo| bar|
+--------------------+---------+
| null|barString|
|[[fooString1], [f...|barString|
+--------------------+---------+

No, there is no easy way to do this as having merge incompatible schema in the same document collection is an anti-pattern, even in Mongo.
There are three main approaches to deal with this:
Fix the data in MongoDB.
Issue a query that "normalizes" the Mongo schema, e.g., drops fields with incompatible types or converts them or renames them, etc.
Issue separate queries to Mongo for documents of a particular schema type. (Mongo has query operators that can filter based on the type of a field.) Then post-process in Spark and, finally, union the data into a single Spark dataset.

Related

Scala : how to pass variable in a UDF and use in withColumn

I have a variable of type Map[String, Set[String]
val metadata = Map(a -> Set(b ,c))
val colToUse = "existingcol" // Option[String]
I am trying to add a new column in my dataFrame using metadata and colToUse which is an existing column in my dataframe
value of metadata is Set of Strings and
key is just a string which is a value of a column in df.
eg :
val metadata = Map['mike', ['physics','chemistry']]
val colToUse = 'student_name' // student_name is a column name in df
'mike' will be a value of "student_name" column.
i am trying to add a new column in existing DF where i can add subjects of each student based on student_name and metadata
myDF.withColumn("subjects", metadata.getorelse(col(colToUse), set.empty)
The above will not work in scala as i need pass columns only in withColumn.
Tried using UDF
def logic: (Map[String, Set[String]], String) => Set[String] =
(metadata: Map[String, Set[String]], colToUse: String) => {
metadata.getOrElse(colToUse, Set("a"))
}
def myUDF = udf(logic)
def getVal: Column = { myUDF(metadata, col(colToUse.get) }
and using it in withcolumn :
myDF.withColumn("newCol", getVal(metadata, colToUse)
Getting error : Unsupported literal type class scala.Tuple2
Looking for a best simplistic way to approach this ?
Issue 2: In getVal , for passing metadata a column is expected but i am passing a map
Is something like this is what you need:
val spark = SparkSession.builder().master("local[1]").getOrCreate()
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row("mike"))),
StructType(List(StructField("student_name", StringType)))
)
df.show()
First test dataframe:
+------------+
|student_name|
+------------+
| mike|
+------------+
And now, create the udf that uses the map:
val metadata = Map("mike" -> Set("physics", "chemistry"))
val colToUse = "student_name"
def createUdf =
udf((key: String) => metadata.getOrElse(key, Set.empty))
and uset it in withColumn function:
df.withColumn("subjects", createUdf(col(colToUse))).show()
it gives:
+------------+--------------------+
|student_name| subjects|
+------------+--------------------+
| mike|[physics, chemistry]|
+------------+--------------------+
am I missing something?

Loading JSON file with schema in spark is loading null data due to case sensitivity

I am trying to load the JSON file with Schema but the columns of the schema are all of the lowercase and the keys in JSON file are not, so the data loaded is as null.
I am able to load the file with the inferred schema but that is not an option.
I have also tried setting spark.sql.caseSensitive=true but it didn't work rather added those as new columns.
Is there any property that can be set to make it work or do I have to preprocess all these JSON files before loading to spark.
JSON can have the missing key-values.
for e.g.
{"id": "0001","type": "donut"}
{"Id": "0002","Type": "Cakedonut"}
{"ID": "0002"}
AFAIK there is no implicit setting that can combine your schema consider it as a feature of Spark you can use below code to achieve your goal.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.conf.set("spark.sql.caseSensitive","true")
val df = spark.read.json("src/main/resources/test.json")
val finalColumns = df.columns.groupBy(_.toLowerCase)
.map(t => functions.coalesce(t._2.map(col):_*).as(t._1))
.toArray
df.select(finalColumns: _*).show()
+---------+----+
| type| id|
+---------+----+
| donut|0001|
|Cakedonut|0002|
| null|0002|
+---------+----+

spark convert dataframe to dataset using case class with option fields

I have the following case class:
case class Person(name: String, lastname: Option[String] = None, age: BigInt) {}
And the following json:
{ "name": "bemjamin", "age" : 1 }
When I try to transform my dataframe into a dataset:
spark.read.json("example.json")
.as[Person].show()
It shows me the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve 'lastname' given input columns: [age, name];
My question is: If my schema is my case class and it defines that the lastname is optional, shouldn't the as() do the conversion?
I can easily fix this using a .map but I would like to know if there is another cleaner alternative to this.
We have one more option to solve above issue.There are 2 steps required
Make sure that fields that can be missing are declared as nullable
Scala types(like Option[_]).
Provide a schema argument and not depend on schema inference.You can use for example use Spark SQL Encoder:
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
You can update code as below.
val schema = Encoders.product[Person].schema
val df = spark.read
.schema(schema)
.json("/Users/../Desktop/example.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+
When you are performing spark.read.json("example.json").as[Person].show(), it is basically reading the dataframe as ,
FileScan json [age#6L,name#7]
and then trying to apply the Encoders for Person object hence getting the AnalysisException as it is not able to find lastname from your json file.
Either you could hint spark that lastname is optional by supplying some data that has lastname or
try this:
val schema: StructType = ScalaReflection.schemaFor[Person].dataType.asInstanceOf[StructType]
val x = spark.read
.schema(schema)
.json("src/main/resources/json/x.json")
.as[Person]
+--------+--------+---+
| name|lastname|age|
+--------+--------+---+
|bemjamin| null| 1|
+--------+--------+---+
Hope it helps.

aggregating jsonarray into Map<key, list> in spark in spark2.x

I am quite new to Spark. I have a input json file which I am reading as
val df = spark.read.json("/Users/user/Desktop/resource.json");
Contents of resource.json looks like this:
{"path":"path1","key":"key1","region":"region1"}
{"path":"path112","key":"key1","region":"region1"}
{"path":"path22","key":"key2","region":"region1"}
Is there any way we can process this dataframe and aggregate result as
Map<key, List<data>>
where data is each json object in which key is present.
For ex: expected result is
Map<key1 =[{"path":"path1","key":"key1","region":"region1"}, {"path":"path112","key":"key1","region":"region1"}] ,
key2 = [{"path":"path22","key":"key2","region":"region1"}]>
Any reference/documents/link to proceed further would be a great help.
Thank you.
Here is what you can do:
import org.json4s._
import org.json4s.jackson.Serialization.read
case class cC(path: String, key: String, region: String)
val df = spark.read.json("/Users/user/Desktop/resource.json");
scala> df.show
+----+-------+-------+
| key| path| region|
+----+-------+-------+
|key1| path1|region1|
|key1|path112|region1|
|key2| path22|region1|
+----+-------+-------+
//Please note that original json structure is gone. Use .toJSON to get json back and extract key from json and create RDD[(String, String)] RDD[(key, json)]
val rdd = df.toJSON.rdd.map(m => {
implicit val formats = DefaultFormats
val parsedObj = read[cC](m)
(parsedObj.key, m)
})
scala> rdd.collect.groupBy(_._1).map(m => (m._1,m._2.map(_._2).toList))
res39: scala.collection.immutable.Map[String,List[String]] = Map(key2 -> List({"key":"key2","path":"path22","region":"region1"}), key1 -> List({"key":"key1","path":"path1","region":"region1"}, {"key":"key1","path":"path112","region":"region1"}))
You can use groupBy with collect_list, which is an aggregation function that collects all matching values into a list per key.
Notice that the original JSON strings are already "gone" (Spark parses them into individual columns), so if you really want a list of all records (with all their columns, including the key), you can use the struct function to combine columns into one column:
import org.apache.spark.sql.functions._
import spark.implicits._
df.groupBy($"key")
.agg(collect_list(struct($"path", $"key", $"region")) as "value")
The result would be:
+----+--------------------------------------------------+
|key |value |
+----+--------------------------------------------------+
|key1|[[path1, key1, region1], [path112, key1, region1]]|
|key2|[[path22, key2, region1]] |
+----+--------------------------------------------------+

Spark with Kafka streaming save to Elastic search slow performance

I have a list of data, the value is basically a bson document (think json), each json ranges from 5k to 20k in size. It either can be in bson object format or can be converted to json directly:
Key, Value
--------
K1, JSON1
K1, JSON2
K2, JSON3
K2, JSON4
I expect the groupByKey would produce:
K1, (JSON1, JSON2)
K2, (JSON3, JSON4)
so that when I do:
val data = [...].map(x => (x.Key, x.Value))
val groupedData = data.groupByKey()
groupedData.foreachRDD { rdd =>
//the elements in the rdd here are not really grouped by the Key
}
I am so confused the the behaviour of the RDD. I read many articles in the internet including the official website from Spark: https://spark.apache.org/docs/0.9.1/scala-programming-guide.html
Still couldn't achieve what I want.
-------- UPDATED ---------------------
Basically I really need it to be grouped by the key, the key is the index to be used in Elasticsearch, so that I can perform batch process based on the key via Elasticsearch for Hadoop:
EsSpark.saveToEs(rdd);
I can't do per partition because Elasticsearch only accept RDD. I tried to use sc.MakeRDD or sc.parallize, both telling me it is not serializable.
I tried to use:
EsSpark.saveToEs(rdd, Map(
"es.resource.write" -> "{TheKeyFromTheObjectAbove}",
"es.batch.size.bytes" -> "5000000")
Documentation of the config is here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
But it is VERY slow comparing to not using the configuration to define dynamic index based on the value of individual document, I suspect it is parsing every json to fetch the value dynamically.
Here is the example.
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object Test extends App {
val session: SparkSession = SparkSession
.builder.appName("Example")
.config(new SparkConf().setMaster("local[*]"))
.getOrCreate()
val sc = session.sparkContext
import session.implicits._
case class Message(key: String, value: String)
val input: Seq[Message] =
Seq(Message("K1", "foo1"),
Message("K1", "foo2"),
Message("K2", "foo3"),
Message("K2", "foo4"))
val inputRdd: RDD[Message] = sc.parallelize(input)
val intermediate: RDD[(String, String)] =
inputRdd.map(x => (x.key, x.value))
intermediate.toDF().show()
// +---+----+
// | _1| _2|
// +---+----+
// | K1|foo1|
// | K1|foo2|
// | K2|foo3|
// | K2|foo4|
// +---+----+
val output: RDD[(String, List[String])] =
intermediate.groupByKey().map(x => (x._1, x._2.toList))
output.toDF().show()
// +---+------------+
// | _1| _2|
// +---+------------+
// | K1|[foo1, foo2]|
// | K2|[foo3, foo4]|
// +---+------------+
output.foreachPartition(rdd => if (rdd.nonEmpty) {
println(rdd.toList)
})
// List((K1,List(foo1, foo2)))
// List((K2,List(foo3, foo4)))
}