Distributing MongoDB background index creation for single machine MongoDB setup - mongodb

I have a single machine MongoDB setup which satisfies the needs of my application at runtime but impose a significant bottleneck at the data ingestion time as the background indexing on an array field takes days to complete (inverted index). It seems to be the same issue as posted here MongoDB large index build very slow. I wonder if it makes sense to delegate/distribute index creation and then deploy the result index on the main machine. If anyone considered it - would appreciate sharing the experience. Here are some ideas I wanted to test:
Use a distributed job like Hadoop or DataFlow to create index tuples , then load them back to either MongoDB directly or another DB that can be more efficient for storing an inverted index.
Use another service like ElasticSearch that can potentially handle indexing more efficiently; however, I have no experience with it and want to continue hosting everything on the same machine.

At the end I decided to generate all tuples to index with Apache Beam/DataFlow, import all tuples with mongoimport and then create an index on the fields that I need. This way I get an index to query in hours rather than days.

Related

how to choose best mechanism for delete logs saved to mongodb

I'm implementing a logger using MongoDB and I'm quite new to the concept.
The logger is supposed to log each request and Its response.
I'm facing the question of using the TTL Index of mongo or just using the query overnight approach.
I think that the first method might bring some overhead by using a background thread and probably rebuilding the index after each deletion but, it frees space as soon as the documents expire and this might be beneficial.
The second approach, on the other hand, does not have this kind of overhead but it frees up space just at the end of each day.
It seems to me that the second approach will suit my case better as it would not be the case that my server just goes on the edge of not having enough disk space, but it will always be the case that we need to reduce the overhead on the server.
I'm wondering if there are some aspects to the subject that I'm missing and also I'm not sure about the applications of the MongoDB TTL.
Just my opinion:
It seems to be best to store logs in monthly , daily or hourly collection depends on your applications write load , and at the end of the day to just drop() the oldest collections with custom script. From experience TTL indices not working well when there is heavy write load to your collection since they add additional write load based on expiration time.
For example imagine you insert at 06:00h log events with 100k/sec and your TTL index life time is set to 3h , this mean after 3h at 09:00h you will have those 100k/sec deletes applied to your collection that are also stored in the oplog ... , solution in such cases is to add more shards , but it become kind of expensive... , far easier is to just drop the exprired collection ...
Moreover depending on your project size for bigger collections to speed up searches you can additionally shard and pre-split the collections based on compound index hashed datetime field(every log contain timestamp) with another field which you will search often and this will allow you scalable search across multiple distributed shards.
Also note mongoDB is a general purpose document database and fulltext search is kind of limited to expensinve regex expressions , so in case you need to do fast raw fulltext search in your logs some inverse index search engine like elasticsearch on top of your mongoDB backand maybe a good solution to cover this functionality.

Updating data in Mongo sorted by a particular field

I posted this question on Software Engineering portal without conducting any tests. It was also brought to my notice that this needs to be posted on SO, not there. Thanks for the help in advance!
I need Mongo to return the documents sorted by a field value. The easiest way to achieve this would be running the command db.collectionName.find().sort({field:priority}), however, I tried this method on a dummy collection of 1000 documents; it runs in 22ms. I also tried running db.collectionName.find() on the same data, it runs in 3ms, which means that Mongo is taking time to sort and return the documents (which is understandable). Both tests were done in the same environment and were done by adding .explain("executionStats") to the query.
I will be working with a large amount of data and concurrent requests to access DB, so I need the querying to be faster. My question is, is there a way to always keep the data sorted by a field in the DB so that I don't have to sort it over and over for all requests? For instance, some sort of update command that could sort the entire DB once a week or so?
A non-unique index with that field in this collection will give the results you're after and avoid the inefficient in-memory sorting.

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents).
We're wondering what's the best approach for this:
A single query with no filtering to open a cursor and get documents in batches of 5/6k
Iterate with pagination, using a logic of find().limit(5000).skip(currentIteration * 5000)
We're unsure what's the best practice and will yield the best results with minimum impact on performance.
I would go with 1. & 2. mixed if possible: Iterate over your huge dataset in pages but access those pages by querying instead of skipping over them as this may be costly as also pointed out by the docs.
The cursor.skip() method is often expensive because it requires the
server to walk from the beginning of the collection or index to get
the offset or skip position before beginning to return results. As the
offset (e.g. pageNumber above) increases, cursor.skip() will become
slower and more CPU intensive. With larger collections, cursor.skip()
may become IO bound.
So if possible build your pages on an indexed field and process those batches of data with an according query range.
The brutal way
Generally speaking, most drivers load batches of documents anyway. So your languages equivalent of
var docs = db.yourcoll.find()
docs.forEach(
function(doc){
//whatever
}
)
will actually just create a cursor initially, and will then, when the current batch is close to exhaustion, load a new batch transparently. So doing this pagination manually while planning to access every document in the collection will have little to no advantage, but hold the overhead of multiple queries.
As for ETL, manually iterating over the documents to modify and then store them in a new instance does under most circumstances not seem reasonable to me, as you basically reinvent the wheel.
Alternate approach
Generally speaking, there is no one-size-fits all "best" way. The best way is the one that best fits your functional and non-functional requirements.
When doing ETL from MongoDB to MongoDB, I usually proceed as follows:
ET…
Unless you have very complicated transformations, MongoDB's aggregation framework is a surprisingly capable ETL tool. I use it regularly for that purpose and have yet to find a problem not solvable with the aggregation framework for in-MongoDB ETL. Given the fact that in general each document is processed one by one, the impact on your production environment should be minimal, if noticeable at all. After you did your transformation, simply use the $out stage to save the results in a new collection.
Even collection spanning transformations can be achieved, using the $lookup stage.
…L
After you did the extract and transform on the old instance, for loading the data to the new MongoDB instance, you have several possibilities:
Create a temporary replica set, consisting of the old instance, the new instance and an arbiter. Make sure your old instance becomes primary, do the ET part, have the primary step down so your new instance becomes primary and remove the old instance and the arbiter from the replica set. The advantage is that you facilitate MongoDB's replication mechanics to get the data from your old instance to your new instance, without the need to worry about partially executed transfers and such. And you can use it the other way around: Transfer the data first, make the new instance the primary, remove the other members from the replica set perform your transformations and remove the "old" data, then.
Use db.CloneCollection(). The advantage here is that you only transfer the collections you need, at the expense of more manual work.
Use db.cloneDatabase() to copy over the entire DB. Unless you have multiple databases on the original instance, this method has little to now advantage over the replica set method.
As written, without knowing your exact use cases, transformations and constraints, it is hard to tell which approach makes the most sense for you.
MongoDB 3.4 support Parallel Collection Scan. I never tried this myself yet. But looks interesting to me.
This will not work on sharded clusters. If we have parallel processing setup this will speed up the scanning for sure.
Please see the documentation here: https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/

MongoDB integration with Solr

I am beginner with mongodb and its integraiton with Solr. From different posts I got an idea about the integration steps. But need info on the below
I have the data in mongodb, for faster retrieval we are integrating it with Solr.
Solr indexes all mongodb entries. Is this indexing one time activity after integration or Do we need to periodically update Solr to index the entries which got inserted after the integration ?
If we need to periodically update solr, it becomes an extra overhead to maintain it in Solr as well along with mongodb. Best approaches on overcoming it.
As far as I know you do not have official(supported/complete) solution to integrate MongoDB and Solr, but let me give you some ideas/direction.
For me the best approach is when it is possible to modify the application and add to the persistence layer the fact that you have all writes operations done in MongoDB and Solr in the "same" time. Like that you can control exactly what you want to send to the Database and what you want to index for a full text operation. But as I said this means that you have to change your application code. (You will have anyway to change it to be able to query Solr when needed). And yes you have to index all the existing documents the first time
You can use a "connector" approach where MongoDB and Solr are kind of connected together, this could be done in various ways.
You can use for example the MongoDB Connector available here : https://github.com/10gen-labs/mongo-connector
LucidWorks, the company behind Solr has also a connector for MongoDB, documented here : http://docs.lucidworks.com/display/help/Create+a+New+MongoDB+Data+Source# (I have not used it so cannot comment, but it is also an approach)
You point #2 is true, you have to manage two clusters and be sure the data are in sync, and sometimes pay the price of inconsistency between the Solr index and the document just updated in MongoDB... So you need to see if the best approach for your application is to use MongoDB alone or MongoDB with Solr (see comment below)
Just a small comment in addition to this answer:
You are talking about "faster retrieval", not sure it should be the reason, if you write correct queries with correct indexes in MongoDB you should be able to do it without Solr. If you requirement is really oriented towards the power of solr meaning: full text index (with all related features it makes sense)
How large is your data? MongoDB has a few good indexing mechanism of its own.
There is a powerful geo-api and for full text search there is http://docs.mongodb.org/manual/core/index-text/. So it would be ideal to identify if your need fits into MongoDB or you need to spill over to SOLR.
About the indexing part. How often if your data updated? If you can afford to have infrequent updates, then a batch job with once a day re-indexing may work for you. Ideally SOLR would work well for some form of master data.

MongoDB with LOTS OF datas?

I'm a beginner with a non SQL structure like here with MongoDB and I don't find somebody talk about a collection with lots of data, like 1.000.000 entries ? and more ?
I saw a company page on the official site. But nothing with large data companies.
I heard about a combo with SQL : Large data are stocked on SQL tables, and only the "cache" are on MongoDB, but it's the only one solution for MongoDB and large data ?
We're using MongoDB to power Where's it Up, and the api behind it. We're currently pushing in >3 million documents per day. MongoDB is the only storage engine in use. We were keeping a bunch around for a while, but we're now using TTL to delete old records.
Things are going super well, just make sure you have all the indexes you need. Querying a million+ records without an index is bad, regardless of your storage engine. Auto-failover has been super helpful.
Something to watch out for is updating records to include more information, it can be pretty expensive if the document grows past pre-allocated space. We ended up changing how we stored data to avoid updates, and create new documents instead.
MongoDB in it's current incarnation is explicitly designed to make it easy to scale out.
As for the numbers: one of my test databases has 10M records and runs easily on my MacBook Air, which is 4 years old now.
So what you can do when your current cluster can not handle the data stored (either because the indices are too big for your RAM or because of processing the queries takes too long): add another node to your MongoDB cluster. Your performance gain should be something between slightly below linear (if your cluster was in perfect condition otherwise) up to several orders of magnitude (when indices didn't fit into RAM and/or IO was pushed to it's limits before and that situation changed after scaling out).
A word of warning: you should have somebody who knows about MongoDB administration in case you want to put you deployment into production. Though MongoDB administration seems to be easy, it is by no means something to be done by a layman. Especially not for production use.