Mongodb Memory engine vs Redis for caching the writes - mongodb

I have a server for processing the users' page viewing history with Mongodb.
The collections are saved like this when a user views a page
view_collection
{ "_id" : "60b212afb63a57d57a8f0006",
"pageId" : "gh42RzrRqYbp2Hj1y",
"userId" : "9Swh4jkYOPjWSgxjm",
"uniqueString" : "s",
"views" : {
"date" : ISODate("2021-01-14T14:39:20.378+0000"),
"viewsCount" : NumberInt(1)
}}
page_collection
{"_id" : "gh42RzrRqYbp2Hj1y", "views" : NumberInt(30) ,"lastVisitors" : ["9Swh4jkYOPjWSgxjm"]}
user_collection
{
_id:"9Swh4jkYOPjWSgxjm",
"statistics" : {
"totalViewsCount" : NumberInt(1197) }
}
Everything is working fine, Except that I want to find a way to cache the operations going to database .
I've been thinking about how to use Redis to cache the writings and then periodically looping through the Redis-keys to get the results inserted into Database. (But It would be too complicated and needs lots of coding. ) Also, I found Mongodb has In-Memory Storage ,for which I might not need to re-write everything from zero and simply change some config files of mongod to get the cache-write works

Redis is a much less featureful data store than MongoDB. If you don't need any of the MongoDB functionality on your data, you can put it in redis for higher performance.
MongoDB in memory storage engine sounds like a premature optimization.

Related

How to get notified when a new field is added to mongodb collections?

I have a graphQl schema defined which needs to be changed runtime whenever there is a new field added in a mongodb collection. For example, a collection has just two fields before
person {
"age" : "54"
"name" : "Tony"
}
And later a new field, "height" is added.
person {
"age" : "54"
"name" : "Tony"
"height" : "167"
}
I need to change my graphql schema and add height to that. So how do I get alerted or notifications from Mongodb ?
MongoDB does not natively implement event messaging. You cannot, natively, get informed of a DB, collections or document updates.
However, MongoDB integrates an 'operation log' feature, which allows you to get access to a journal log of each write operation on collections.
The journal logs are used for MongoDB replicas, aka cluster synchronization features. In order to activate oplogs you need to have at least two MongoDB instances, a master and a replicate.
Operations logs are built upon the capped collection feature, which allows a collection to be built over an append-only mechanism, which ensures fast writes and tailing cursors. Authors say:
The oplog exists internally as a capped collection, so you cannot
modify its size in the course of normal operations.
MongoDB - Change the Size of the Oplog
And:
Capped collections are fixed-size collections that support
high-throughput operations that insert and retrieve documents based on
insertion order. Capped collections work in a way similar to circular
buffers: once a collection fills its allocated space, it makes room
for new documents by overwriting the oldest documents in the
collection.
MongoDB - Capped Collections
The schema of the documents within an operation log journal looks like:
"ts" : Timestamp(1395663575, 1),
"h" : NumberLong("-5872498803080442915"),
"v" : 2,
"op" : "i",
"ns" : "wiktory.items",
"o" : {
"_id" : ObjectId("533022d70d7e2c31d4490d22"),
"author" : "JRR Hartley",
"title" : "Flyfishing"
}
}
Eg: "op" : "i" means operation is an insertion and "o" is the object inserted.
The same way, you can be informed of update operations:
"op" : "u",
"ns" : "wiktory.items",
"o2" : {
"_id" : ObjectId("533022d70d7e2c31d4490d22")
},
"o" : {
"$set" : {
"outofprint" : true
}
}
Note that the operation logs (you access them as collections) are limited either in disk size or entry numbers (FIFO). This means that, eventually, whwnever your oplog consumers are slower than oplog writers, you will get missed operation log entries, resulting in corrupted consumption results.
This is the reason why MongoDB is terrible for guaranteeing document tracking on highly sollicited clusters, and the reason why solutions for messaging such as Apache Kafka come as supplements for event tracking (eg: event document update)
To answer your question: in a reasonably solicited environment, you might want to take a look at the Javascript Meteor project, which allows you to trigger events based on changes from queries results, and relies on MongoDB oplog features.
Credits: oplogs examples from The MongoDB Oplog
As of MongoDb 3.6 you can subscribe to a change stream. You could subscribe to an "update" event operation. More details here:
https://stackoverflow.com/a/47184757/5103354

From MongoDB to Google Cloud Datastore

I'm figuring out if it would be easy to move my already existing app towards a Google Cloud Datastore. Currently it exists in MongoDB so it would be from NoSQL to NoSQL, but it does not seem that it would be easy. An example.
My app uses a hierarchy of objects which is why I like MongoDB because of its elegance:
{
"_id" : "31851460-7052-4c89-ab51-8941eb5bdc7g",
"_status" : "PRIV",
"_active" : true,
"emails" : [
{ "id" : 1, "email" : "anexample#gmail.com", "_hash" : "1514e2c9e71318e5", "_objecttype" : "EmailObj" },
{ "id" : 1, "email" : "asecondexample#gmail.com", "_hash" : "78687668871318e7", "_objecttype" : "EmailObj" }
],
"numbers": [
...
],
"socialnetworks": [
...
],
"settings": [
...
]
}
While moving towards Google Cloud Datastore, I can save emails, numbers, socialnetworks, settings, etc as a string, but that defeats the whole purpose of using JSON as I will lose the opportunity to query this.
I have a number of tables where I have this issue.
Only solution I see is to move all of these to different entities (tables). But in that case the amount of queries will increase, and therefore also the cost.
Other solution might be to keep only arrays of ids and do key-value like resolves on Google Cloud Datastore which are free (except traffic maybe).
What is the best approach here?
The transition from Mongo to Google's Datastore would not be trivial. A big part of the reason for this is that Datastore (although technically still NoSQL) is a columnar database whereas Mongo is a traditional NoSQL. Just as SQL databases require a different mindset from NoSQL databases, columnar databases require a different mindset again.
The transition from NoSQL to Datastore would require a comprehensive restructuring of your data.
You don't have to save everything in a string. Datastore has what is called Embbebed Entity, which looks very much like in Mongo. Just a full object into another.
Check some library like Objectify which makes it very easy to interact with datastore.

mongoDB sharded cluster and chunks

I'm quite newbie on mongoDB and I'm wondering if it is normal if I don't have any chunk on a mongodb sharded cluster ?
Let me illustrate. I've got three shards :
mongos> use config
mongos> db.getSiblingDB("config").shards.find()
{ "_id" : "shard1", ... }
{ "_id" : "shard2", ... }
{ "_id" : "shard3", ... }
mongos>
I've got some databases, and especially one on shard1:
mongos> db.getSiblingDB("config").databases.find()
{ "_id" : "udev_prod", "partitioned" : false, "primary" : "shard1" }
But no chunks at all... :
mongos> db.getSiblingDB("config").chunks.find()
mongos>
on top of that, if I connect to the udev_prod database and try to get the sharded distribution of any collection, mongoDB tells me it's not sharded...
mongos> db.User.getShardDistribution()
Collection udev_prod.User is not sharded.
I think that I'm missing something here, or it is not working well.. could someone tell me if that situation is "normal" ?
Thanks a lot
Best Regards
Julien
This is the key piece from your find on databases:
"partitioned" : false
That means that the database does not have sharding enabled. You need to enable sharding for the database first, and then shard a collection (and pick a shard key) before any chunks are created. Otherwise the database just lives on one shard - it's still usable, just not sharded.
There is a full tutorial available for setting up a sharded cluster with sharded collections, this is the section you want to start with.

Mongodb query optimization - running query in parallel

I am trying to run some wild card/regex based query on mongo cluster from java driver.
Mongo Replica Set config:
3 member replica
16 CPU(hyperthreaded), 24G RAM Linux x86_64
Collection size: 6M rows, 7G data
Client is localhost (mac osx 10.8) with latest mongo-java driver
Query using java driver with readpref = primaryPreffered
{ "$and" : [{ "$or" : [ { "country" : "united states"}]} , { "$or" : [ { "registering_organization" : { "$regex" : "^.*itt.*hartford.*$"}} , { "registering_organization" : { "$regex" : "^.*met.*life.*$"}} , { "registering_organization" : { "$regex" : "^.*cardinal.*health.*$"}}]}]}
I have regular index on both "country" and "registering_organization". But as per mongo docs a single query can utilize only one index and I can see that from explain() on above query as well.
So my question is what is the best alternative to achieve better performance in above query.
Should I break the 'and' operations and do in memory intersection. Going further I shall have 'Not' operations in query too.
I think my application may turn into reporting/analytics in future but that's not down the line or i am not looking to design accordingly.
There are so many things wrong with this query.
Your nested conditional with regexes will never get faster in MongoDB. MongoDB is not the best tool for "data discovery" (e.g. ad-hoc, multi-conditional queries for uncovering unknown information). MongoDB is blazing fast when you know the metrics you are generating. But, not for data discovery.
If this is a common query you are running, then I would create an attribute called "united_states_or_health_care", and set the value to the timestamp of the create date. With this method, you are moving your logic from your query to your document schema. This is one common way to think about scaling with MongoDB.
If you are doing data discovery, you have a few different options:
Have your application concatenate the results of the different queries
Run query on a secondary MongoDB, and accept slower performance
Pipe your data to Postgresql using mosql. Postgres will run these data-discovery queries much faster.
Another Tip:
Your regexes are not anchored in a way to be fast. It would be best to run your "registering_organization" attribute through a "findable_registering_organization" filter. The filter would break apart the organization into an array of queryable name subsets, and you would quite using the regexes. +2 points if you can filter incoming names by an industry lookup.

Why do reads in MongoDB sometimes wait for lock?

While using db.currentOp(), I sometimes see operations like:
{
"opid" : 1238664,
"active" : false,
"lockType" : "read",
"waitingForLock" : true,
"op" : "query",
.....
"desc" : "conn"
}
Why does a read operation need to wait for a lock? Is there a way to tell a query to ignore any pending writes and just go ahead and read anyway?
You can't tell a query ignore pending writes because of mongodb indexes working in synchronous way. And this is by design.
For example indexes in RavenDB can work in async and sync way. So may be you need ravendb (if you on windows) ;)
Why do reads in MongoDB sometimes wait for lock?
They are waiting for index rebuild.