Presto & MongoDB - Schema creation and updates - mongodb

I have the Presto set-up done locally and am able to query data from MongoDB collections. At the start, I created presto_schema collection into MongoDB to let Presto understand the Collection details in order to query and I had one Collection entry added into my presto_schema. However, I noticed later that any new Collection into MongoDB, which was not added into presto_schema is still accessible from Presto and upon the first query it is observed that the new collection details are automatically amended into the presto_schema collection with the relevant new collection schema details.
But for the collections with nested schema, it is missing to automatically add all the nested fields and it only adds what it identifies from the initial query.
For example, consider below is my Collection (new_collection), which it got created newly with content as below:
{
"_id" : "13ec5e2a-ef04-4d05-b971-ef8e65638f83",
"name" : "npt",
"client" : "npt_client",
"attributes" : {
"level" : 697,
"country" : "SC",
"doy" : 2022
}
}
And say if my first query from Presto is as below:
presto:mydb> select count(*) from new_collection where attributes.level > 200;
The presto_schema gets automatically added with a new entry for this new collection, however, it adds all the non-nested fields info and the nested fields too that are available from the initial query, but fails to add the other nested fields. So any queries on the other nested fields, Presto does not recognize them. I could go ahead and amend the presto_schema with all the missing nested fields, but wondering if there is any other automated way. So that, we don't need to keep amending it manually on any new field addition into the collection (considering a scenario where we have a complete dynamic fields, which would be added into the Collection's nested object).

I would recommend upgrading to Trino (formerly PrestoSQL) because the MongoDB connector (version >= 360) supports mapping fields to JSON type. This type mapping is unavailable in prestodb.
https://trino.io/download.html

Related

How to autogenerate new _id in target table when migrating a DocumentDB table using AWS DMS to another DocumentDB table

I have a AWS DocumentDB with schema my-schema and table called my-table which has a structure something like
{
"_id": { "FIELD_1" : "001", "FIELD_2" : "A1" },
"FIELD_1": "001",
"FILED_2": "A1",
.
.
.
}
as you can see that the _id contains FIELD_1 & FIELD_2. combination of these two fields are unique for all the records. These two fields formed the composite primary key in the original oracle db that is why when we migrated from oracle to DocumentDB, AWS DMS chose to put that into _id.
Now the problem is, we need _id to be a mongodb ObjectId instead of a nested json.
What I have tried is:
create a source endpoint with my DocumentDB (which contains this bad _id data, in schema my-schema).
create a target endpoint with the same DocumentDB but with new schema my-new-schema and same table name my-table.
Then I migrate the data from my-schema to my-new-schema using transformations (remove column _id).
But It still replicates the same nested _id into the target table.
I have tried both document metadata mode & table metadata mode.
In table metadata mode, it doesn't even transfer data because after it flattens the _id into _id.FIELD_1 & _id.FIELD_2. DMS throws an exception "Document can't have '.' in field names"
I know that I can do this easily using code but if it is somehow possible to achieve my goal using DMS I would prefer that.
or can we achieve this using mongodb commands directly?
Not sure about DMS, but I think you can do this using an aggregation query with the $out stage. Project the fields you need, exclude _id, the documents in the new collection will be inserted with the usual ObjectId. Something like this:
db.collection.aggregate([
{$project:
{_id:0,
FIELD_1:1,
FIELD_2:1
}},
{$out: 'new_collection'}
])

Updating mongoData with MongoSpark

From the following tutorial provided by Mongo:
MongoSpark.save(centenarians.write.option("collection", "hundredClub").mode("overwrite"))
am I correct in understanding that What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
lets say I've got data that looks like
{"_id" : ObjectId(12345), "name" : "John" , "Occupation" : "Baker"}
What I would then like to do is to merge the record of the person from another file that has more details, I.E. that file looks like
{"name" : "John", "address" : "1800 some street"}
the goal is to update the record in Mongo so now the JSON looks like
{"_id" : ObjectId(12345) "name" : "John" , "address" : 1800 some street", "Occupation" : "Baker"}
Now here's the thing, lets assume that we just want to update John, and that there are millions of other records that we would like to leave as is.
There are a few questions here, I'll try to break them down.
What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
Correct, as of mongo-spark v2.x, if you specify mode overwrite, MongoDB Connector for Spark will first drop the collection the save new result into the collection. See source snippet for more information.
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
The patch described on SPARK-66 (mongo-spark v1.1+) is , if a dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted. 
What I would then like to do is to merge the record of the person from another file that has more details
As mentioned above, you need to know the _id value from your collection. Example steps:
Create a dataframe (A) by reading from your Person collection to retrieve John's _id value. i.e. ObjectId(12345).
Merge _id value of ObjectId(12345) into your dataframe (B - from the other file with more information). Utilise unique field value to join the two dataframes (A and B).
Save the merged dataframe (C). Without specifying overwrite mode.
we just want to update John, and that there are millions of other records that we would like to leave as is.
In that case, before you merge the two dataframes, filter out any unwanted records from dataframe B (the one from the other file with more details). In addition, when you call save(), specify mode append.

In MongoDB find out when last query to a collection was? (Removing stale collections)

I would like to find out how old/stale a collection is, I was wondering if there was a way to know when the last query was made to a collection, or even get a list of all collections last access date.
If your Mongodb collection document _id is of the following format "_id" : ObjectId("57bee0cbc9735bf0b80c23e0") then Mongodb stores the create document timestamp.
This can be retrieved by executing the following query
db.newcollection.findOne({"_id" : ObjectId("57bee0cbc9735bf0b80c23e0")})._id.getTimestamp();
the result would be an ISODate like this ISODate("2016-08-25T12:12:59Z")
find out how old/stale a collection
There is no predefined libraries available in mongodb to track the oldness of a collection. But it is doable by maintaining a log where we can keep an entry when we are accessing a collection.
References
ObjectID.getTimestamp()
Log messages
Rotate Log files
db.collection.stats()

How to comapre all records of two collections in mongodb using mapreduce?

I have an use case in which I want to compare each record of two collections in mongodb and after comparing each record I need to find mismatch fields of all record.
Let us take an example, in collection1 I have one record as {id : 1, name : "bks"}
and in collection2 I have a record as {id : 1, name : "abc"}
When I compare above two records with same key, then field name is a mismatch field as name is different.
I am thinking to achieve this use case using mapreduce in mongodb. But I am facing some problems while accessing collection name in map function. When I tried to compare it in map function, I got error as : "errmsg" : "exception: ReferenceError: db is not defined near '
Can anyone give me some thoughts on how to compare records using mapreduce?
I might have helped you to read the documentation:
When upgrading to MongoDB 2.4, you will need to refactor your code if your map-reduce operations, group commands, or $where operator expressions include any global shell functions or properties that are no longer available, such as db.
So from your error fragment, you appear to be referencing db in order to access another collection. You cannot do that.
If indeed you are intending to "compare" items in one collection to those in another, then there is no other approach other than looping code:
db.collection.find().forEach(function(doc) {
var another = db.anothercollection.findOne({ "_id": doc._id });
// Code to compare
})
There is simply no concept of "joins" as such available to MongoDB, and operations such as mapReduce or aggregate or others strictly work with one collection only.
The exception is db.eval(), but as per all of strict warnings in the documentation, this is almost always a very bad idea.
Live with your comparison in looping code.

MongoDB - DBRef to a DBObject

Using Java ... not that it matters.
Having a problem and maybe it is just a design issue.
I assign "_id" field to all of my documents, even embedded ones.
I have a parent document ( and the collection for those ) which has an embedded document
So I have something like:
{ "_id" : "49902cde5162504500b45c2c" ,
"name" : "MongoDB" ,
"type" : "database" ,
"count" : 1 ,
"info" : { "_id" : "49902cde5162504500b45c2y",
"x" : 203 ,
"y" : 102
}
}
Now I want to have another document which references my "info" via a DBRef, don't want a copy. So, I create a DBRef which points to the collection of the parent document and specifies the _id as xxxx5c2y. However, calling fetch() on the DBRef gives a NULL.
Does it mean that DBRef and fetch() only works on top level collection entry "_id" fields?
I would have expected that fetch() would consume all keys:values within the braces of the document .. but maybe that is asking too much. Does anyone know?? Is there no way to create cross document references except at the top level?
Thanks
Yes, your DBRef _id references need to be to documents in your collection, not to embedded documents.
If you want to find the embedded document you'll need to do a query on info._id and you'll need to add an index on that too (for performance) OR you'll need to store that embedded document in a collection and treat the embedded one as a copy. Copying is OK in MongoDB ... 'one fact one place' doesn't apply here ... provided you have some way to update the copy when the main one changes (eventual consistency).
BTW, on DBRef's, the official guidance says "Most developers only use DBRefs if the collection can change from one document to the next. If your referenced collection will always be the same, the manual references outlined above are more efficient."
Also, why do you want to reference info within a document? If it was an array I could understand why you might want to refer to individual entries but since it doesn't appear to be an array in your example, why not just refer to the containing document by its _id?