Deserializing mongodb "$numberLong" in spray-json - mongodb

I've got following problem -> I have case class which has among other things - timestamp as Long. This serialized timestamp in Json from mongodb looks like this:
"timestamp" : { "$numberLong" : "1460451019201" }
Can you tell me how to properly deserialize it into my case class using spray-json?

Related

Does ReactiveMongo handle extended JSON to BSON conversion fully?

I have been trying to use reactivemongo to insert some documents into a mongodb collection with a few BSON types.
I am using the Play JSON library to parse and manipulate some documents in extended JSON, here is one example:
{
"_id" : {"$oid": "5f3403dc7e562db8e0aced6b"},
"some_datetime" : {
"$date" : {"$date": 1597841586927}
}
}
I'm using reactivemongo-play-json, and so I have to import the following so my JsObject is automatically cast to a reactivemongo BSONDocument when passing it to collection.insert.one
import reactivemongo.play.json.compat._
import json2bson._
Unfortunately, once I open my mongo shell and look at the document I just inserted, this is the result:
{
"_id" : ObjectId("5f3403dc7e562db8e0aced6b"),
"some_datetime" : {
"$date" : NumberLong("1597244282116")
},
}
Only the _id has been understood as a BSON type described using extended JSON, and I'd expect the some_datetime field to be something like a ISODate(), same as I'd expect to see UUID()-type values instead of their extended JSON description which looks like this:
{'$binary': 'oKQrIfWuTI6JpPbPlYGYEQ==', '$type': '04'}
How can I make sure this extended JSON is actually converted to proper BSON types?
Turns out the problem is that what I thought to be extended JSON is actually not; my datetime should be formatted as:
{"$date": {"$numberLong": "1597841586927"}}
instead of
{"$date": 1597841586927}
The wrong format was introduced by my data source - a kafka connect mongo source connector not serializing documents to proper extended JSON by default (see this stackoverflow post).

Transforming ReactiveMongo JSON with Play JSON

ReactiveMongo's JSON functionality generates objects (JsObject in play-json parlance) rather than scalars for certain MongoDB datatypes like BSONObjectID and BSONDateTime. For example, you get JSON like this:
{
"_id" : {
"$oid" : "5de32e618f02001d8d521757" //BSONObjectID
},
"createdAt" : {
"$date" : 15751396221447 //BSONDateTime
}
}
Aside from being cumbersome to deal with, I would prefer not to expose JSON that leaks MongoDB concerns to REST clients.
The tricky thing is that these values occur throughout the tree, so I need to write a Play JSON transformer smart enough to recursively transform the above at every level to look like this:
{
"id" : "5de32e618f02001d8d521757",
"createdAt" : 15751396221447
}
One failed attempt to do this for just BSONObjectID is this:
(JsPath \ "_id").json.update(
JsPath.read[JsObject].map{ o => o ++ Json.obj( "id" -> (o \ f"$$oid").as[String]) }
)
How can I do this?

Issue while parsing mongo collection which has few schemas in spark

I'm moving data from one collection to another in other cluster using Spark. the data's schema is not consistent(I mean that has few schema's in a single collection with different data types with little variations). When I try to read data from spark, the sampling is unable to get all the schema's of the data and throwing the below error.(I have a complex schema which I can't explicitly mention instead of spark gets by sampling.)
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a NullType (value: BsonArray{values=[{ "type" : "GUEST_FEE", "appliesPer" : "GUEST_PER_NIGHT", "description" : null, "minAmount" : 33, "maxAmount" : 33 }]})
I tried reading the collection as an RDD and write as an RDD still the issue persists.
Any help on this.!
Thanks.
All these com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast SOME_TYPE into a NullType come from incorrect schema inference. For schema-less data sources such as JSON file or mongodb, Spark does a scan of small fraction of the data to determine the types. If some particular field has lots of NULL's you can get unlucky and type will be set as NullType.
One thing you can do is increase the number of entries scanned for schema inference.
Another - get the inferred schema first, fix it, and reload dataframe with fixed schema:
def fix_spark_schema(schema):
if schema.__class__ == pyspark.sql.types.StructType:
return pyspark.sql.types.StructType([fix_spark_schema(f) for f in schema.fields])
if schema.__class__ == pyspark.sql.types.StructField:
return pyspark.sql.types.StructField(schema.name, fix_spark_schema(schema.dataType), schema.nullable)
if schema.__class__ == pyspark.sql.types.NullType:
return pyspark.sql.types.StringType()
return schema
collection_schema = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load() \
.schema
collection = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load(schema=fix_spark_schema(collection_schema))
In my case all problematic fields could be represented with StringType, you might make the logic more complex if needed.
As far as I understood your problem:
* either Spark incorrectly detected your schema and considered some fields as required (nullable = false) - in such case, you can still define it explicitly and set nullable to true. It would work if your schema was evolving and in some time in the past you added or removed a field but still keeping column type (e.g. String will be always a String and not a Struct or other completely different type)
* or your schemas are completely inconsistent, i.e. your String field transformed at some time to a Struct or other completely different type. In such case I don't see other solution than use RDD abstraction and work with very permissive types as Any in Scala (Object in Java) and using isInstanceOf tests to normalize all fields into a 1 common format
Actually I see also another possible solution, but only if you know what data has which schema. For instance, if you know that for data between 2018-01-01 and 2018-02-01 you use schema#1 and for the others schema#2, you can write a pipeline that will transform schema#1 to schema#2. Later you could simply union both datasets and apply your transformations on consistently structured values.
Edit:
I've just tried similar code you give and it worked correctly on my local MongoDB instance:
val sc = getSparkContext(Array("mongodb://localhost:27017/test.init_data"))
// Load sample data
import com.mongodb.spark._
val docFees =
"""
| {"fees": null}
| {"fees": { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ]} }
""".stripMargin.trim.stripMargin.split("[\\r\\n]+").toSeq
MongoSpark.save(sc.parallelize(docFees.map(Document.parse)))
val rdd = MongoSpark.load(sc)
rdd.saveToMongoDB(WriteConfig(Map("uri"->"mongodb://localhost:27017/test.new_coll_data", "replaceDocument"->"true")))
And when I checked the result in MongoDB shell I got:
> coll = db.init_data;
test.init_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
> coll = db.new_coll_data;
test.new_coll_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }

mongodb pojo codec conversion missing id field

I have a following document structure in mongo
{
"_id" : 4771902,
"upc" : "test-upc-v1",
"reportingCategory" : {
"id" : 14,
"department" : "Footwear"
}
}
My java class looks like
public class Product {
private Long _id;
private String upc;
private ReportingCategory reportingCategory;
}
public class ReportingCategory {
private Long id;
private String department;
}
I am using mongo pojo codec for conversion. "id" field under ReportingCategory is being returned as null.
Rest every other data is available. I can see that data when I convert it into RawBsonDocument, but seems like it gets lost in pojo conversion.
"id" field has no index on it, and is not used to uniquely identify this document.
Has anyone faced something similar and any work around for it ?
P.S. I am using mongo 3.6, with 3.6 async driver.
This is indeed a feature/bug in mongodb java driver.
Anyone looking for a reason and a solution for this can find one here https://jira.mongodb.org/browse/JAVA-2750

How to define a json object type in a JacksonDB Mapper for MongoDB?

I am building a Play framework Java application with mongodb as the backend.
I am using Jackson DB Mapper to define the model.
One of the models needs to have a field called "filter" with the following value in the Mongodb,
"filters" : {
"nutrients" : "fiber",
"course" : "starter",
"cuisine" : "indian",
"locale" : "india",
"specialdiet" : "spicy"
}
How do I define this object in Java Jackson DB?
Please forgive my jargon since I am new Jackson DB mapper.