ReactiveMongo's JSON functionality generates objects (JsObject in play-json parlance) rather than scalars for certain MongoDB datatypes like BSONObjectID and BSONDateTime. For example, you get JSON like this:
{
"_id" : {
"$oid" : "5de32e618f02001d8d521757" //BSONObjectID
},
"createdAt" : {
"$date" : 15751396221447 //BSONDateTime
}
}
Aside from being cumbersome to deal with, I would prefer not to expose JSON that leaks MongoDB concerns to REST clients.
The tricky thing is that these values occur throughout the tree, so I need to write a Play JSON transformer smart enough to recursively transform the above at every level to look like this:
{
"id" : "5de32e618f02001d8d521757",
"createdAt" : 15751396221447
}
One failed attempt to do this for just BSONObjectID is this:
(JsPath \ "_id").json.update(
JsPath.read[JsObject].map{ o => o ++ Json.obj( "id" -> (o \ f"$$oid").as[String]) }
)
How can I do this?
Related
I'm moving data from one collection to another in other cluster using Spark. the data's schema is not consistent(I mean that has few schema's in a single collection with different data types with little variations). When I try to read data from spark, the sampling is unable to get all the schema's of the data and throwing the below error.(I have a complex schema which I can't explicitly mention instead of spark gets by sampling.)
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a NullType (value: BsonArray{values=[{ "type" : "GUEST_FEE", "appliesPer" : "GUEST_PER_NIGHT", "description" : null, "minAmount" : 33, "maxAmount" : 33 }]})
I tried reading the collection as an RDD and write as an RDD still the issue persists.
Any help on this.!
Thanks.
All these com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast SOME_TYPE into a NullType come from incorrect schema inference. For schema-less data sources such as JSON file or mongodb, Spark does a scan of small fraction of the data to determine the types. If some particular field has lots of NULL's you can get unlucky and type will be set as NullType.
One thing you can do is increase the number of entries scanned for schema inference.
Another - get the inferred schema first, fix it, and reload dataframe with fixed schema:
def fix_spark_schema(schema):
if schema.__class__ == pyspark.sql.types.StructType:
return pyspark.sql.types.StructType([fix_spark_schema(f) for f in schema.fields])
if schema.__class__ == pyspark.sql.types.StructField:
return pyspark.sql.types.StructField(schema.name, fix_spark_schema(schema.dataType), schema.nullable)
if schema.__class__ == pyspark.sql.types.NullType:
return pyspark.sql.types.StringType()
return schema
collection_schema = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load() \
.schema
collection = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load(schema=fix_spark_schema(collection_schema))
In my case all problematic fields could be represented with StringType, you might make the logic more complex if needed.
As far as I understood your problem:
* either Spark incorrectly detected your schema and considered some fields as required (nullable = false) - in such case, you can still define it explicitly and set nullable to true. It would work if your schema was evolving and in some time in the past you added or removed a field but still keeping column type (e.g. String will be always a String and not a Struct or other completely different type)
* or your schemas are completely inconsistent, i.e. your String field transformed at some time to a Struct or other completely different type. In such case I don't see other solution than use RDD abstraction and work with very permissive types as Any in Scala (Object in Java) and using isInstanceOf tests to normalize all fields into a 1 common format
Actually I see also another possible solution, but only if you know what data has which schema. For instance, if you know that for data between 2018-01-01 and 2018-02-01 you use schema#1 and for the others schema#2, you can write a pipeline that will transform schema#1 to schema#2. Later you could simply union both datasets and apply your transformations on consistently structured values.
Edit:
I've just tried similar code you give and it worked correctly on my local MongoDB instance:
val sc = getSparkContext(Array("mongodb://localhost:27017/test.init_data"))
// Load sample data
import com.mongodb.spark._
val docFees =
"""
| {"fees": null}
| {"fees": { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ]} }
""".stripMargin.trim.stripMargin.split("[\\r\\n]+").toSeq
MongoSpark.save(sc.parallelize(docFees.map(Document.parse)))
val rdd = MongoSpark.load(sc)
rdd.saveToMongoDB(WriteConfig(Map("uri"->"mongodb://localhost:27017/test.new_coll_data", "replaceDocument"->"true")))
And when I checked the result in MongoDB shell I got:
> coll = db.init_data;
test.init_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
> coll = db.new_coll_data;
test.new_coll_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
I've got following problem -> I have case class which has among other things - timestamp as Long. This serialized timestamp in Json from mongodb looks like this:
"timestamp" : { "$numberLong" : "1460451019201" }
Can you tell me how to properly deserialize it into my case class using spray-json?
I have contact documents which contain an embedded document "lead". So the data would look like this:
{
"_id" : ObjectId("54f8fa496d6163ad64010000"),
"name" : "teretyrrtuyytiuyi",
"email" : "rertytruyy#fdgjioj.com",
"fax" : "",
"birth_date" : null,
"phone" : "dfgdfhfgjg",
"phone_2" :
"hgjhkhjkljlj",
"lead" : { "_id" : ObjectId("54f8fa496d6163ad64020000"), "appointment status" : "dfhgfgjghjk" }
}
When there are many contacts, there will be many leads. I want to retrieve all the leads in the collection, without retrieving the owning contact. I tried the following but none seem to work:
db.lead.find()
db.contacts.find({ 'lead.$' : 1})
Any way to do this?
If that query makes sense for you, you should have probably used a different data structure. If your embedded document has an id, it is almost certainly supposed to be a first-level citizen instead.
You can work around this using the aggregation framework, but I'd consider that a hack that probably works around some more profound problem with your data model.
It's also not very elegant:
>
> db.contacts.aggregate({ $project : {
"appointment_status" : "$lead."appointment_status",
"lead_id" : "$lead.id", ... } });
>
That way, it'll look as if leads was a collection of its own, but it's not and this is just a bad hack around it.
Note that there's no wildcard operator, so if you want to have all fields projected to the root level, you'll have to do it manually. It'd be much easier to simply read the regular documents - if that's not what you need, correct your schema design.
I know this has been covered quite a lot on here, however, i'm very new to MongoDB and am struggling with applying answers i've found to my situation.
In short, I have two collections 'total_by_country_and_isrc' which is the output from a MapReduce function and 'asset_report' which contains an asset_id not present in the 'total_by_country_and_isrc' collection or the original raw data collection this was MapReduced from.
An example of the data in 'total_by_country_and_isrc' is:
{ "_id" : { "custom_id" : 4748532, "isrc" : "GBCEJ0100080",
"country" : "AE" }, "value" : 0 }
And an example of the data in the 'asset_report' is:
{ "_id" : ObjectId("51824ef016f3edbb14ef5eae"), "Asset ID" :
"A836656134476364", "Asset Type" : "Web", "Metadata Origination" :
"Unknown", "Custom ID" : "4748532", "ISRC" : "", }
I'd like to end up with the following ('total_by_country_and_isrc_with_asset_id'):
{ "_id" : { "Asset ID" : "A836656134476364", "custom_id" : 4748532,
"isrc" : "GBCEJ0100080", "country" : "AE" }, "value" : 0 }
I know how I would approach with in a relational database but I really want to try and get this working in Mongo as i'm dealing with some pretty large collections and feel Mongo is the right tool for the job.
Can anyone offer some guidance here?
I think you want to use the "reduce" output action: Output to a Collection with an Action. You'll need to regenerate total_by_country_and_isrc, because it doesn't look like asset_report has the fields it needs to generate the keys you already have in total_by_country_and_isrc – so "joining" the data is impossible.
First, write a map method that is capable of generating the same keys from the original collection (used to generate total_by_country_and_isrc) and also from the asset_report collection. Think of these keys as the "join" fields.
Next, map and reduce your original collection to create total_by_country_and_isrc with the correct keys.
Finally, map asset_report with the same method you used to generate total_by_country_and_isrc, but use a reduce function that can be used to reduce the intersection (by key) of this mapped data from asset_report and the data in total_by_country_and_isrc.
I would like to construct a query that will return just the names of the classes for the datastructure below.
So far, the closest I've come is using dot notation
db.mycoll.find({name:"game1"},{"classes.1.name":true})
But the problem with this approach is that it will only return the name of the first class. Please help me get the names of all three classes.
I wish I could use a wild card as below, but I'm not sure if that exists.
db.mycoll.find({name:"game1"},{"classes.$*.name":true})
Datastructure:
{
"name" : "game1",
"classes" : {
"1" : {
"name" : "warlock",
"version" : "1.0"
},
"2" : {
"name" : "shaman",
"version" : "2.0"
},
"3" : {
"name" : "mage",
"version" : "1.0"
}
}
There is no simple query that will achieve the results you seek. MongoDB has limited support for querying against sub-objects or arrays of objects. The basic premise with MongoDB is that you are querying for the top-level document.
That said, things are changing and you still have some options:
Use the new MongoDB Aggregation Framework. This has a $project operation that should do what you're looking for.
You can return just the classes field and then merge the names together. This should be trivial with most languages.
Note, that it's not clear what you're doing with classes. Is this an array or an object? If it's an object, what does classes.1 actually represent? Is it different from classes.warlock?