Is it possible to run queries on 200GB data on mongodb with 16GB RAM? - mongodb

I am trying to run a simple query to find number of all records with a particular value using:
db.ColName.find({id_c:1201}).count()
I have 200GB of data. When I run this query, mongodb takes up all the RAM and my system starts lagging. After an hour of futile waiting, I give up without getting any results.
What can be the issue here and how can I solve it?

I believe the right approach in the NoSQL world isn't trying to perform a full query like that, but accumulate stats overtime.
For example, you should have a collection stats with arbitrary objects which should own a kind or id property that can take a value like "totalUserCount". Whenever you add an user, you also update this count.
This way you'll get instant results. It's just getting a property value in a small collection of stats.
BTW, this slowness should be originated by querying objects by a non-indexed property in your collection. Try to index id_c and probably you'll get quicker results.

That amount of data can easily be managed by MySQL, MSSQL or Oracle with the given hardware specification. You don't need a NoSQL database for that, NoSQL databases are made for much larger storing needs which actually require lots of hardware (RAM, harddisks) to be efficient.
You need to define an index to read that id and use a normal SQL database.

Related

What are the production best practices to store a large number of document when using MongoDB?

I am in need of storing applications transaction logs. Decided to use MongoDB. Every day there are almost 200000+- data is storing in single node MongoDB.
We have some reports and operation(if something happened then do something) depending on those logs. So, need to find documents matching different criteria. If going on that pace, is it vulnerable? Will it be slow to execute query?
Any suggestions to make it efficient to use MongoDB?
By the way, those data are in single collection. And MongoDB server version: 4.2.6
mongo collections can grow to be many terabytes without much issue. to be able to query that data in a speedy manner, you will have to analyze your queries and create indexes for the fields that are used in those queries.
indexes are not free though. they will take both diskspace and use up RAM, because for indexes to be useful, they need to fit entirely in RAM.
in most cases, if indexes and collections grow beyond what your hardware can handle, you will have to archive/evict old data and trim down the collections.
if your queries need to include that evicted data in order to generate your reports, you will have to have another collection for summarized values/data of the evicted records which you will have to combine with present data when generating the reports.
alternatively sharding can help with big data but there are some limitations on queries you can do with sharded collections.

Updating data in Mongo sorted by a particular field

I posted this question on Software Engineering portal without conducting any tests. It was also brought to my notice that this needs to be posted on SO, not there. Thanks for the help in advance!
I need Mongo to return the documents sorted by a field value. The easiest way to achieve this would be running the command db.collectionName.find().sort({field:priority}), however, I tried this method on a dummy collection of 1000 documents; it runs in 22ms. I also tried running db.collectionName.find() on the same data, it runs in 3ms, which means that Mongo is taking time to sort and return the documents (which is understandable). Both tests were done in the same environment and were done by adding .explain("executionStats") to the query.
I will be working with a large amount of data and concurrent requests to access DB, so I need the querying to be faster. My question is, is there a way to always keep the data sorted by a field in the DB so that I don't have to sort it over and over for all requests? For instance, some sort of update command that could sort the entire DB once a week or so?
A non-unique index with that field in this collection will give the results you're after and avoid the inefficient in-memory sorting.

MongoDB with LOTS OF datas?

I'm a beginner with a non SQL structure like here with MongoDB and I don't find somebody talk about a collection with lots of data, like 1.000.000 entries ? and more ?
I saw a company page on the official site. But nothing with large data companies.
I heard about a combo with SQL : Large data are stocked on SQL tables, and only the "cache" are on MongoDB, but it's the only one solution for MongoDB and large data ?
We're using MongoDB to power Where's it Up, and the api behind it. We're currently pushing in >3 million documents per day. MongoDB is the only storage engine in use. We were keeping a bunch around for a while, but we're now using TTL to delete old records.
Things are going super well, just make sure you have all the indexes you need. Querying a million+ records without an index is bad, regardless of your storage engine. Auto-failover has been super helpful.
Something to watch out for is updating records to include more information, it can be pretty expensive if the document grows past pre-allocated space. We ended up changing how we stored data to avoid updates, and create new documents instead.
MongoDB in it's current incarnation is explicitly designed to make it easy to scale out.
As for the numbers: one of my test databases has 10M records and runs easily on my MacBook Air, which is 4 years old now.
So what you can do when your current cluster can not handle the data stored (either because the indices are too big for your RAM or because of processing the queries takes too long): add another node to your MongoDB cluster. Your performance gain should be something between slightly below linear (if your cluster was in perfect condition otherwise) up to several orders of magnitude (when indices didn't fit into RAM and/or IO was pushed to it's limits before and that situation changed after scaling out).
A word of warning: you should have somebody who knows about MongoDB administration in case you want to put you deployment into production. Though MongoDB administration seems to be easy, it is by no means something to be done by a layman. Especially not for production use.

How to reduce number of documents to be sync from a mongo DB

In my current project, I am using two databases.
A MongoDB instance gathering data from different data providers (abt 15M documents)
Another (relational) database instance holding only the data which is needed for the application, i.e. a subset of the data in the MongoDB instance. (abt 5M rows)
As part of the synchronisation process, I need to regularly check for new entries in the MongoDB depending on data in the relational DB.
Let's say, this is about songs and artists, a document in the MongoDB might look like this:
{_id:1,artists:["Simon","Garfunkel"],"name":"El Condor Pasa"}
Part of the sync process is to import/update all songs from those artists that already exist in the relational DB, which are currently about 1M artists.
So how do I retrieve all songs of 1M named artists from MongoDB for import?
My first thought (and try) was to over all artists and query all songs for each artist (of course, there's an index on the "artists" field). But this takes several minutes for each batch of 1.000 artists, which would make this process a long runner.
My second thought was to write all existing artists to a separate mongoDB collection and have a super query which only retrieves songs of artists that are stored in there. But so far I have not been able retrieve data based on two collections.
Is this a good use case for map/reduce? If yes, can someone pls. give me a hint on how to achieve this? (I am not completely new to NoSQL, but sort of a newbie when it comes to map/reduce.)
Or is this idea just crazy and I have to stick with a process that's running for several days?
Thanks in advance for any hints.
If you regularly need to check for changes, then add a timestamp to your data, and incorporate that timestamp into your query. For example, if you add a "created_ts" attribute, then you can look for records that were created since the last time your batch ran.
Here are a few ideas for making the mongo interaction more efficient:
Reduce network overhead by using an "in" query. Play around with the size of the array of artist IDs in order to determine what works best for your case.
Reduce network overhead by only selecting or reading the attributes that you need.
Make sure that your documents are indexed by artist.
On the Mongo server, make sure that as much of your data fits into memory as possible. Retrieving data from disk is going to be slow no matter what else you do. If it doesn't fit into memory, then you have a few options -- buy more memory; shrink your data set (ex. drop attributes that you don't actually need); shard; etc.

Hadoop Map/Reduce - simple use example to do the following

I have MySQL database, where I store the following BLOB (which contains JSON object) and ID (for this JSON object). JSON object contains a lot of different information. Say, "city:Los Angeles" and "state:California".
There are about 500k of such records for now, but they are growing. And each JSON object is quite big.
My goal is to do searches (real-time) in MySQL database.
Say, I want to search for all JSON objects which have "state" to "California" and "city" to "San Francisco".
I want to utilize Hadoop for the task.
My idea is that there will be "job", which takes chunks of, say, 100 records (rows) from MySQL, verifies them according to the given search criteria, returns those (ID's) which qualify.
Pros/cons? I understand that one might think that I should utilize simple SQL power for that, but the thing is that JSON object structure is pretty "heavy", if I put it as SQL schemas, there will be at least 3-5 tables joins, which (I tried, really) creates quite a headache, and building all the right indexes eats RAM faster than I one can think. ;-) And even then, every SQL query has to be analyzed to be utilizing the indexes, otherwise with full scan it literally is a pain. And with such structure we have the only way "up" is just with vertical scaling. But I am not sure it's the best option for me, as I see how JSON objects will grow (the data structure), and I see that the number of them will grow too. :-)
Help? Can somebody point me to simple examples of how this can be done? Does it make sense at all? Am I missing something important?
Thank you.
Few pointers to consider:
Hadoop (HDFS specifically) distributes data around a cluster of machines. Using MapReduce to analyze/process this data requires that the data is stored on the HDFS to make use of the parallel processing power Hadoop offers.
Hadoop/MapReduce is no where near real-time. Even when running on small amounts of data the time Hadoop takes to set-up a Job can be 30+ seconds. This is something that can't be stopped.
Maybe something to look into would be using Lucene to index your JSON objects as documents. You could store the index in solr and easily query on anything you want.
in fact you are.. because searching in a single huge field for text will take much more time than indexing the database and searching the proper sql way. The database was built to be used with sql and indexes, it does not have the capability to parse and index json, so whatever way you will find to search in the json (probably just hacky string matching) will be much slower. 500k rows is not that much to handle for mysql , you don't really need hadoop, just a good normalized schema , the right indices and optimized queries
Sounds like you are trying to recreate CouchDB. CouchDB is built with a map-reduce framework and is made to work specifically with JSON objects.