From MongoDB to Google Cloud Datastore - mongodb

I'm figuring out if it would be easy to move my already existing app towards a Google Cloud Datastore. Currently it exists in MongoDB so it would be from NoSQL to NoSQL, but it does not seem that it would be easy. An example.
My app uses a hierarchy of objects which is why I like MongoDB because of its elegance:
{
"_id" : "31851460-7052-4c89-ab51-8941eb5bdc7g",
"_status" : "PRIV",
"_active" : true,
"emails" : [
{ "id" : 1, "email" : "anexample#gmail.com", "_hash" : "1514e2c9e71318e5", "_objecttype" : "EmailObj" },
{ "id" : 1, "email" : "asecondexample#gmail.com", "_hash" : "78687668871318e7", "_objecttype" : "EmailObj" }
],
"numbers": [
...
],
"socialnetworks": [
...
],
"settings": [
...
]
}
While moving towards Google Cloud Datastore, I can save emails, numbers, socialnetworks, settings, etc as a string, but that defeats the whole purpose of using JSON as I will lose the opportunity to query this.
I have a number of tables where I have this issue.
Only solution I see is to move all of these to different entities (tables). But in that case the amount of queries will increase, and therefore also the cost.
Other solution might be to keep only arrays of ids and do key-value like resolves on Google Cloud Datastore which are free (except traffic maybe).
What is the best approach here?

The transition from Mongo to Google's Datastore would not be trivial. A big part of the reason for this is that Datastore (although technically still NoSQL) is a columnar database whereas Mongo is a traditional NoSQL. Just as SQL databases require a different mindset from NoSQL databases, columnar databases require a different mindset again.
The transition from NoSQL to Datastore would require a comprehensive restructuring of your data.

You don't have to save everything in a string. Datastore has what is called Embbebed Entity, which looks very much like in Mongo. Just a full object into another.
Check some library like Objectify which makes it very easy to interact with datastore.

Related

Mongodb Memory engine vs Redis for caching the writes

I have a server for processing the users' page viewing history with Mongodb.
The collections are saved like this when a user views a page
view_collection
{ "_id" : "60b212afb63a57d57a8f0006",
"pageId" : "gh42RzrRqYbp2Hj1y",
"userId" : "9Swh4jkYOPjWSgxjm",
"uniqueString" : "s",
"views" : {
"date" : ISODate("2021-01-14T14:39:20.378+0000"),
"viewsCount" : NumberInt(1)
}}
page_collection
{"_id" : "gh42RzrRqYbp2Hj1y", "views" : NumberInt(30) ,"lastVisitors" : ["9Swh4jkYOPjWSgxjm"]}
user_collection
{
_id:"9Swh4jkYOPjWSgxjm",
"statistics" : {
"totalViewsCount" : NumberInt(1197) }
}
Everything is working fine, Except that I want to find a way to cache the operations going to database .
I've been thinking about how to use Redis to cache the writings and then periodically looping through the Redis-keys to get the results inserted into Database. (But It would be too complicated and needs lots of coding. ) Also, I found Mongodb has In-Memory Storage ,for which I might not need to re-write everything from zero and simply change some config files of mongod to get the cache-write works
Redis is a much less featureful data store than MongoDB. If you don't need any of the MongoDB functionality on your data, you can put it in redis for higher performance.
MongoDB in memory storage engine sounds like a premature optimization.

MongoDB Aggregation query running very slow

We version most of our collections in Mongodb. The selected versioning mechanism is as follows:
{ "docId" : 174, "v" : 1, "attr1": 165 } /*version 1 */
{ "docId" : 174, "v" : 2, "attr1": 165, "attr2": "A-1" }
{ "docId" : 174, "v" : 3, "attr1": 184, "attr2" : "A-1" }
So, when we perform our queries we always need to use the aggregation framework in this way to ensure get latest versions of our objects:
db.docs.aggregate( [
{"$sort":{"docId":-1,"v":-1}},
{"$group":{"_id":"$docId","doc":{"$first":"$$ROOT"}}}
{"$match":{<query>}}
] );
The problem with this approach is once you have done your grouping, you have a set of data in memory which has nothing to do with your collection and thus, your indexes cannot be used.
As a result, the more documents your collection has, the slower the query gets.
Is there any way to speed this up?
If not, I will consider to move to one of the approaches defined in this good post: http://www.askasya.com/post/trackversions/
Just in order to complete this question, we went with option 3: one collection to keep latest versions and one collection to keep historical ones. It is introduced here: http://www.askasya.com/post/trackversions/ and some further description (with some nice code snippets) can be found in http://www.askasya.com/post/revisitversions/.
It has been running in production now for 6 months. So far so good. Former approach meant we were always using the aggregate framework which moves away from indexes as soon as you modify the original schema (using $group, $project...) as it doesn't match anymore the original collection. This was making our performance terrible as the data was growing.
With the new approach though the problem is gone. 90% of our queries goes against latest data and this means we target a collection with a simple ObjectId as identifier and we do not require aggregate framework anymore, just regular finds.
Our queries against historical data always include id and version so by indexing these (we include both as _id so we get it out of the box), reads towards those collections are equally fast. This is a point though not to overlook. Read patterns in your application are crucial when designing how your collections/schemas should look like in MongoDB so you must ensure you know them when taking such decisions.

Performing 'find all documents that ...' queries

I'm coming from a Postgres background, and I am currently contemplating whether I should use a noSQL database such as mongoDB in my next project. For this I have a few questions
Is it possible to perform queries in noSQL that can fetch all the documents that have some common subdocument/attribute, example "select all users where country = italy"
Also, how is redundancy handled in noSQL? say I have a document that represents a given car model that multiple people can own. Would I then have to insert the same exact data in all these People documents, describing the given car model?
Thanks
Sure you can do queries with where clause in MongoDB (and other NoSQL engine), if I take your example, you will store the users into a "collection" named "users", and query it more or less the same way.
db.users.find( { "country" : "Italy" } );
MongoDB has a very rich and powerful query and aggregation engine ( http://docs.mongodb.org/manual/tutorial/query-documents/ ) , I am inviting your to follow the tutorials ( http://docs.mongodb.org/manual/tutorial/getting-started/ )or free online training ( http://university.mongodb.com ), to learn more about it.
To insert the document it is also really easy:
db.users.insert( {"first_name" : "John", "last_name" : "Doe", "country" : "USA" } );
that's it!
You talk about redundancy, like in your SQL world it depends a lot of the design. In MongoDB you will organize your document and link between them (linked or embedded documents) based on your business needs. It is hard to give an answer about document design in this context so I will invite you to read some interesting articles:
MongoDB Documentation : http://docs.mongodb.org/manual/data-modeling/
MongoDB Blog, 6 Rules of Thumb for MongoDB Schema Design (part 1,2,3)
http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1
http://blog.mongodb.org/post/87892923503/6-rules-of-thumb-for-mongodb-schema-design-part-2
http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3
Answer to the question about users and cars, will be "It depends of your application".
If your application is mostly read, and you need most of the data about cars & users, duplication (denormalization), will be a good approach to make the development easy. (and yes you will need more work when you have to update the information...). The blog post and documentation should help you to find your way.
Yes, you can do queries in MongoDB on documents that have common subdocument/attribute.
MongoDB encourages embedding(de-normalization) of data as disk space is cheap and embedding docs can result in better query performance.Yes, embedding documents would mean inserting the same data in all People documents in a Car model. But if you want to avoid duplication/de-normalization , then you can go for 'referencing' which is a normalized model which stores data something like this: People collection will contain people docs and Car collection will contain car docs, similar to what we have in rdbms. But the primary/foreign key relationship is not imposed by MongoDB. You will end up doing joins in code and hence query performance will get degraded.

Design pattern for directed acyclic graphs in MongoDB

The problem
As usual the problem is to display a directed acyclic graph in a database. The choices for a Database I had were a relational database like mysql or mongodb. I chose mongoDb because DAGs in relational databases are a mess but if there is a trick I just didn't find please tell me.
The goal is to map a DAG in one or multiple MongoDB Documents. Because we have multiple children and parents SubDocuments where no possibility. I came across multiple design patterns but am not sure which one is the best to go with.
Tree-structure with Ancestors Array
The Ancestors Array is suggested by the mongoDB docs. And is quite easy to understand. As I understand it my document would look like this:
{
"_id" : "root",
"ancestors" : [ null ],
"left": 1
}
{
"_id" : "child1",
"ancestors" : [ "root" ],
"left": 2
}
{
"_id" : "child2",
"ancestors" : [ "root", "child1" ],
"left": 1
}
This allows me to find all children of an element like this:
db.Tree.find({ancestors: 'root'}).sort({left: -1})
and all parents like this:
db.Tree.findOne({_id: 'child1'}).ancestors
DBRefs instead of Strings
My second approach would be to replace the string-keys with DBRefs. But except for longer database records I don't see many advantages over the ancestors array.
String-based array with children and parents
The last idea is to store not only the children of each document but it's parents as well. This would give me all the features I want. The downside is the massive overhead of information I would create by storing all relations two times. Further on I am worried by the amount of administration there is. E.g. if a document gets deleted I have to check all others for a reference in multiple fields.
My Questions
Is MongoDb the right choice over a relational database for this purpose?
Are there any up-/downsides of any of my pattern I missed?
Which pattern would you suggest and why? Do you maybe have experience with one of them?
Why don't you use a graph database? Check ArangoDB, you can use documents like MongoDB and also graphs. MongoDB is a great database, but not for storing graph oriented documents. ArangoDB does.
https://www.arangodb.com/

How to mapreduce on key from another collection

Say I have a collection of users like this:-
{
"_id" : "1234",
"Name" : "John",
"OS" : "5.1",
"Groups" : [{
"_id" : "A",
"Name" : "Group A"
}, {
"_id" : "C",
"Name" : "Group C"
}]
}
And I have a collection of events like this:-
{
"_id" : "15342",
"Event" : "VIEW",
"UserId" : "1234"
}
I'm able to use mapreduce to work out the count of events per user as I can just emit the "UserId" and count off of that, however what I want to do now is count events by group.
If I had a "Groups" array in my event document then this would be easy, however I don't and this is only an example, the actual application of this is much more complicated and I don't want to replicate all that data into the event document.
I've see an example at http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/ but I can't see how that applies in this situation as it is aggregating values from two places... all I really want to do is perform a lookup.
In SQL I would simply JOIN my flattened UserGroup table to the event table and just GROUP BY UserGroup.GroupName
I'd be happy with multiple passes of mapreduce... first pass to count by UserId into something like { "_id" : "1234", "count" : 9 } but I get stuck on next pass... how to include the group id
Some potential approaches I've considered:-
Include group info in the event document (not feasible)
Work out how to "join" the user collection or look-up the users groups from within the map function so I can emit the group id's as well (don't know how to do this)
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
What is possible and what are the benefits/issues with each approach?
Your third approach is the way to go:
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
To do this you'll need to create a new collection J with the "joined" data you need for map-reduce. There are several strategies you can use for this:
Update your application to insert/update J in the normal course of business. This is best in the case where you need to run MR very frequently and with up-to-date data. It can add substantially to code complexity. From an implementation standpoint, you can do this either directly (by writing to J) or indirectly (by writing changes to a log collection L and then applying the "new" changes to J). If you choose the log collection approach you'll need a strategy for determining what's changed. There are two common ones: high-watermark (based on _id or a timestamp) and using the log collection as a queue with the findAndModify command.
Create/update J in batch mode. This is the way to go in the case of high-performance systems where the multiple updates from the above strategy would affect performance. This is also the way to go if you do not need to run the MR very frequently and/or you do not have to guarantee up-to-the-second data accuracy.
If you go with (2) you will have to iterate over documents in the collections you need to join--as you've figured out, Mongo map-reduce won't help you here. There are many possible ways to do this:
If you don't have many documents and if they are small, you can iterate outside of the DB with a direct connection to the DB.
If you cannot do (1) you can iterate inside the DB using db.eval(). If the number of documents in not small, make sure to use nolock: true as db.eval is blocking by default. This is typically the strategy I choose as I tend to deal with very large document sets and I cannot afford to move them over the network.
If you cannot do (1) and do not want to do (2) you can clone the collections to another node with a temporary DB. Mongo has a convenient cloneCollection command for this. Note that this does not work if the DB requires authentication (don't ask why; it's a strange 10gen design choice). In that case you can use mongodump and mongorestore. Once you have the data local to a new DB you can party on it as you see fit. Once you complete the MR you can update the result collection in your production DB. I use this strategy for one-off map-reduce operations with heavy pre-processing so as to not load the production replica sets.
Good luck!