How to reduce number of documents to be sync from a mongo DB - mongodb

In my current project, I am using two databases.
A MongoDB instance gathering data from different data providers (abt 15M documents)
Another (relational) database instance holding only the data which is needed for the application, i.e. a subset of the data in the MongoDB instance. (abt 5M rows)
As part of the synchronisation process, I need to regularly check for new entries in the MongoDB depending on data in the relational DB.
Let's say, this is about songs and artists, a document in the MongoDB might look like this:
{_id:1,artists:["Simon","Garfunkel"],"name":"El Condor Pasa"}
Part of the sync process is to import/update all songs from those artists that already exist in the relational DB, which are currently about 1M artists.
So how do I retrieve all songs of 1M named artists from MongoDB for import?
My first thought (and try) was to over all artists and query all songs for each artist (of course, there's an index on the "artists" field). But this takes several minutes for each batch of 1.000 artists, which would make this process a long runner.
My second thought was to write all existing artists to a separate mongoDB collection and have a super query which only retrieves songs of artists that are stored in there. But so far I have not been able retrieve data based on two collections.
Is this a good use case for map/reduce? If yes, can someone pls. give me a hint on how to achieve this? (I am not completely new to NoSQL, but sort of a newbie when it comes to map/reduce.)
Or is this idea just crazy and I have to stick with a process that's running for several days?
Thanks in advance for any hints.

If you regularly need to check for changes, then add a timestamp to your data, and incorporate that timestamp into your query. For example, if you add a "created_ts" attribute, then you can look for records that were created since the last time your batch ran.
Here are a few ideas for making the mongo interaction more efficient:
Reduce network overhead by using an "in" query. Play around with the size of the array of artist IDs in order to determine what works best for your case.
Reduce network overhead by only selecting or reading the attributes that you need.
Make sure that your documents are indexed by artist.
On the Mongo server, make sure that as much of your data fits into memory as possible. Retrieving data from disk is going to be slow no matter what else you do. If it doesn't fit into memory, then you have a few options -- buy more memory; shrink your data set (ex. drop attributes that you don't actually need); shard; etc.

Related

Small collection with document average size of 30kb - mongodb query very slow

I am building a chat service and I designed the chat schema, so that it nests all the users that belong to this chat. I only nested the essential data of users such as name and avatarUrl and userId.
I believe compared to relational databases, this nesting feature is the power of MongoDB, where you basically store the "JOIN"ed data. So querying a nested document should generally be faster than querying a chat row and joining to multiple user rows.
Now even though I only store essential data, some of the chat document size became quite large(30kB), because there were chat rooms where it had more than 100 users. There will be a max user limit so the chat document will not grow indefinitely. But nesting about 100 users, leading to a document size of about 30kb looks reasonable to me.
But, then I realized that the chat page loading with large users became significantly slow. I measured the time taken for the query to execute from backend(node.js on local environment. so there is some latency. my laptop is in Korea and the database server is in US).
For small documents the query time was within 230ms, but for the 30kb document a simple Chat.findOne({_id:chatId }), took about 500ms and thats just way too long. The collection only has like 100 documents, so index would not improve performance.
Now two things come in my mind.
First, Why is the document so big? Maybe its best practice to remove keys and store everything in array(matrix) format? This would be terrible to work with in the backend... But maybe this is necessary for performance optimization later. Is this common practice?
Second, my real question. Why does it take so long to load 20kb of data? if 20kb already takes up 500ms, I am guessing that larger documents would be pretty much unusable.
I am a huge fan of MongoDB and its nesting style. But if the document needs to stay small for reasonable response time, then this is really upsetting. Because it means I can only use nesting sparingly, then I would have to use $lookup or mongoose populate to make joins, which are terrible in terms of performance.
I am currently using MongoDB 4.4 with MongoDB Atlas M10. The current app has a very small user pool, which is why I thought M10 is sufficient.

What is the best way to archive history data in mongo

I have a collection in mongo that stores every user action of my application, and its very huge in size (3Million documents per day). On UI I have a requirement to show the user actions for max. 6months period.
And the queries on this collection are becoming very slow with all the historic data, though there are indexes in place. So, I want to move the documents that are older than 6months to a separate collection.
Is it the right way to handle my issue?
Following are some of the techniques you can use to manage data growth in MongoDB:
Using capped collection
Using TTLs
Using mulitple collections for months
Using different databases on same host

Querying a large mongodb collection in real-time

We have a service that allow people to open a room and play YouTube songs while others are listening in real-time.
Among other collections in our MongoDB we have one to store songs user adding to the room's playlists, it calls: userSong.
This collection holds records for all songs added for the combination of: user-room-song.
The code makes frequent queries to the collection in those major operations:
Loading current playlist (regular find with a trivial condition)
Loading random song for a room (using Mongo aggregation FW)
Loading room top songs (using Mongo aggregation FW)
Now, this table become big (+1m records) and things start become slow, AWS start sending us CPU utilization notifications more often and follow by mongotop the userSong collection makes the CPU high consumption mostly in READ operations.
We made some modifications in the collection indexes and it helps a bit but it's still not a solution, we need to find some other way to arrange the data cause it exponentially growing.
We tought about to split the userSong data into a low level segmentation, instead of by user-room-song to do it by collection of user-song for each room in the system, this will short the time to fetching data from the DB, now we need to decide how to do that:
Make a new collection for each room (roomUserSong) that will hold all user-song records for a particula room. this might be good for quick fetching but will create an unlimited new collectons in the database (roomusersong-1,roomusersong-2, ..., roomusersong-n) and we dont know if it's a good in practice or there are some others Mongo limitations in that kind of solution.
Create just 1 more collection in the DB with the following fields:
{room: <roomId>, userSongs: [{userSong1, userSong2, ..., userSongN}], so each room will have it's own document and inside it a sub document (an Array) that holds all user-song records for this room. this will solve the previous issue (to create unlimited collections) but it'll be very hard to work with Mongoose (our ODM) alter cause (as far as i know) we cannot define a schema in advanced for this such data structure. also this is may tak us to the sub-document size limitation that is 16MB as far as understood.
It'll be nice to hear some advices from people who have Mongo experience with those kind situations:
Is +1m is really consider big and supposed to make this CPU utilization issues? (using AWS m3.medium, one core)
What is the better solution approach form what introduced?
Any other ideas to make smart cache without change too much the code?
Thanks for helpers!

Is it possible to run queries on 200GB data on mongodb with 16GB RAM?

I am trying to run a simple query to find number of all records with a particular value using:
db.ColName.find({id_c:1201}).count()
I have 200GB of data. When I run this query, mongodb takes up all the RAM and my system starts lagging. After an hour of futile waiting, I give up without getting any results.
What can be the issue here and how can I solve it?
I believe the right approach in the NoSQL world isn't trying to perform a full query like that, but accumulate stats overtime.
For example, you should have a collection stats with arbitrary objects which should own a kind or id property that can take a value like "totalUserCount". Whenever you add an user, you also update this count.
This way you'll get instant results. It's just getting a property value in a small collection of stats.
BTW, this slowness should be originated by querying objects by a non-indexed property in your collection. Try to index id_c and probably you'll get quicker results.
That amount of data can easily be managed by MySQL, MSSQL or Oracle with the given hardware specification. You don't need a NoSQL database for that, NoSQL databases are made for much larger storing needs which actually require lots of hardware (RAM, harddisks) to be efficient.
You need to define an index to read that id and use a normal SQL database.

When should I create a new collections in MongoDB?

So just a quick best practice question here. How do I know when I should create new collections in MongoDB?
I have an app that queries TV show data. Should each show have its own collection, or should they all be store within one collection with relevant data in the same document. Please explain why you chose the approach you did. (I'm still very new to MongoDB. I'm used to MySql.)
The Two Most Popular Approaches to Schema Design in MongoDB
Embed data into documents and store them in a single collection.
Normalize data across multiple collections.
Embedding Data
There are several reasons why MongoDB doesn't support joins across collections, and I won't get into all of them here. But the main reason why we don't need joins is because we can embed relevant data into a single hierarchical JSON document. We can think of it as pre-joining the data before we store it. In the relational database world, this amounts to denormalizing our data. In MongoDB, this is about the most routine thing we can do.
Normalizing Data
Even though MongoDB doesn't support joins, we can still store related data across multiple collections and still get to it all, albeit in a round about way. This requires us to store a reference to a key from one collection inside another collection. It sounds similar to relational databases, but MongoDB doesn't enforce any of key constraints for us like most relational databases do. Enforcing key constraints is left entirely up to us. We're good enough to manage it though, right?
Accessing all related data in this way means we're required to make at least one query for every collection the data is stored across. It's up to each of us to decide if we can live with that.
When to Embed Data
Embed data when that embedded data will be accessed at the same time as the rest of the document. Pre-joining data that is frequently used together reduces the amount of code we have to write to query across multiple collections. It also reduces the number of round trips to the server.
Embed data when that embedded data only pertains to that single document. Like most rules, we need to give this some thought before blindly following it. If we're storing an address for a user, we don't need to create a separate collection to store addresses just because the user might have a roommate with the same address. Remember, we're not normalizing here, so duplicating data to some degree is ok.
Embed data when you need "transaction-like" writes. Prior to v4.0, MongoDB did not support transactions, though it does guarantee that a single document write is atomic. It'll write the document or it won't. Writes across multiple collections could not be made atomic, and update anomalies could occur for how many ever number of scenarios we can imagine. This is no longer the case since v4.0, however it is still more typical to denormalize data to avoid the need for transactions.
When to Normalize Data
Normalize data when data that applies to many documents changes frequently. So here we're talking about "one to many" relationships. If we have a large number of documents that have a city field with the value "New York" and all of a sudden the city of New York decides to change its name to "New-New York", well then we have to update a lot of documents. Got anomalies? In cases like this where we suspect other cities will follow suit and change their name, then we'd be better off creating a cities collection containing a single document for each city.
Normalize data when data grows frequently. When documents grow, they have to be moved on disk. If we're embedding data that frequently grows beyond its allotted space, that document will have to be moved often. Since these documents are bigger each time they're moved, the process only grows more complex and won't get any better over time. By normalizing those embedded parts that grow frequently, we eliminate the need for the entire document to be moved.
Normalize data when the document is expected to grow larger than 16MB. Documents have a 16MB limit in MongoDB. That's just the way things are. We should start breaking them up into multiple collections if we ever approach that limit.
The Most Important Consideration to Schema Design in MongoDB is...
How our applications access and use data. This requires us to think? Uhg! What data is used together? What data is used mostly as read-only? What data is written to frequently? Let your applications data access patterns drive your schema, not the other way around.
The scope you've described is definitely not too much for "one collection". In fact, being able to store everything in a single place is the whole point of a MongoDB collection.
For the most part, you don't want to be thinking about querying across combined tables as you would in SQL. Unlike in SQL, MongoDB lets you avoid thinking in terms of "JOINs"--in fact MongoDB doesn't even support them natively.
See this slideshare:
http://www.slideshare.net/mongodb/migrating-from-rdbms-to-mongodb?related=1
Specifically look at slides 24 onward. Note how a MongoDB schema is meant to replace the multi-table schemas customary to SQL and RDBMS.
In MongoDB a single document holds all information regarding a record. All records are stored in a single collection.
Also see this question:
MongoDB query multiple collections at once