mongolastic is taking long time to index a collection - mongodb

I am using mongolastic, to index a collection in elasticsearch.
It took around 6 hours to index a collection having 30,000 documents. Is there a way we can increase the efficiency?
Also, it was noted that the indexing was done in batch (of 200), can we increase this limit too?
Any suggestions?

As for the limit of the batch size - this is taken from the link you provided yourself:
Override the default batch size which is normally 200. (Optional)
batch: <number> (6)

Related

Default batch size in aggregate command in MongoDB

From aggregate command doucumentation`:
To indicate a cursor with the default batch size, specify cursor: {}.
However, I haven't found the value of such defaul or how to find it (maybe using a mongo admin command).
How to find such value?
From the docs:
The MongoDB server returns the query results in batches. The amount of data in the batch will not exceed the maximum BSON document size.
New in version 3.4: Operations of type find(), aggregate(), listIndexes, and listCollections return a maximum of 16 megabytes per batch. batchSize() can enforce a smaller limit, but not a larger one.
find() and aggregate() operations have an initial batch size of 101 documents by default. Subsequent getMore operations issued against the resulting cursor have no default batch size, so they are limited only by the 16 megabyte message size.
So, the default for the first batch is 101 documents, the batch size for subsequent getMore() calls is undetermined but cannot exceed 16 megabytes.
If I'm not entirely wrong, I think it's 101 for aggregation pipeline.
See here

How to increase pagesize temporarily?

Just for testing purpose I would like to get 100 , 500 , 1000 , 5000 , 10000 , 20000 ... records from a Collection. At the moment the largest pagesize is 1000. How can I increase it to whatever I would like for just testing ?
RESTHeart has a pagesize limit of 1000 pages per request and that's hardcoded into class org.restheart.handlers.injectors.RequestContextInjectorHandler.
If you, for any reason, want to increase that limit then you have to change the source code and build your own jar.
However, RESTHeart speedups the execution of GET requests to collections resources via its db cursors pre-allocation engine. This applies when several documents need to be read from a big collection and moderates the effects of the MongoDB cursor.skip() method that slows downs linearly. So it already optimizes the navigation of large MongoDB collections, if this is what you are looking for.
Please have a look at Speedup Requests with Cursor Pools and Performances page in the official documentation for more information.

Maximum size of bulk create (or update) for Cloudant?

When using bulk operation with Cloudant. Is there a "hard" limit (size of all documents / number of documents)?
Also: is there a best practice setting? (size of all documents / number of documents per request)?
I understand there is a 65Mb limit in the size of individual documents in Cloudant. Having said that, I would try to avoid getting anywhere near that size of document.
A rule of thumb would be if the size of your documents is over a few tens of kilobytes, you might be better creating more documents and retrieving them using a view.
In terms of bulk operations, I tend to use batches of 500 documents. Bulk operations are a much more efficient way of transferring data between your client software and Cloudant and a 500 document batch size (as long as your document size is reasonable) is a good rule of thumb.
There is no such number that says, how many documents we can update in bulk, but There is a size limit of 1 MB for whole bulk document request object of 1 MB for whole request. if requested data is more than 1 MB then request will be rejected.
As I tested myself with JsonObject with 12 fields, it took around 2K documents to cover 1MB size. But still it can be vary if you small & large content.
Click here for more information under Rule 14: Use the bulk API

Unable to fetch more than 10k records

I am developing an app where I have more than 10k records added to a class in parse. Now I am trying to fetch those records using PFQuery( I am using the "skip" property). But I am unable to fetch records beyond 10k and I get the following error message
"Skips larger than 10000 are not allowed"
This is a big problem for me since I need all the data.
Has anybody come across such problem. Please share your views.
Thanks
The problem is indeed due to the cost of mongo skip operations. You can formulate a query such that you don't need the skip operator. My preferred method is to orderBy objectId and then add a condition that objectId > last yielded objectId. This type of query can be indexed and remain fast, unlike skip pagination, which has a O(N^2) cost in seeks.
My assumption would be that it's based on performance issues with MongoDB's skip implementation.
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

Are there any tools to estimate index size in MongoDB?

I'm looking for a tool to get a decent estimate of how large a MongoDB index will be based on a few signals like:
How many documents in my collection
The size of the indexed field(s)
The size of the _id I'm using if not ObjectId
Geo/Non-geo
Has anyone stumbled across something like this? I can imagine it would be extremely useful given Mongo's performance degradation once it hits the memory wall and documents start getting paged out to disk. If I have a functioning database and want to add another index, the only way I'll know if it will be too big is to actually add it.
It wouldn't need to be accurate down to the bit, but with some assumptions about B-Trees and the index implementation I'm sure it could be reasonable enough to be helpful.
If this doesn't exist already I'd like to build and open source it, so if I've missed any required parameters for this calculation please include in your answer.
I just spoke with some of the 10gen engineers and there isn't a tool but you can do a back of the envelope calculation that is based on this formula:
2 * [ n * ( 18 bytes overhead + avg size of indexed field + 5 or so bytes of conversion fudge factor ) ]
Where n is the number of documents you have.
The overhead and conversion padding are mongo specific but the 2x comes from the b-tree data structure being roughly half full (but having allocated 100% of the space a full tree would require) in the worst case.
I'd explain more but I'm learning about it myself at the moment. This presentation will have more details: http://www.10gen.com/presentations/mongosp-2011/mongodb-internals
You can check the sizes of the indexes on a collection by using command:
db.collection.stats()
More details here: http://docs.mongodb.org/manual/reference/method/db.collection.stats/#db.collection.stats
Another way to calculate is to ingest ~1000 or so documents into every collection, in other words, build a small scale model of what you're going to end up within production, create indexes or what have you and calculate the final numbers based on db.collection.stats() average.
Edit (from a comment):
Tyler's answer
describes the original MMAP storage engine circa MongoDB 2.0, but this
formula definitely isn't applicable to modern versions of MongoDB.
WiredTiger, the default storage engine in MongoDB 3.2+, uses index
prefix compression so index sizes will vary based on the distribution
of key values. There are also a variety of index types and options
which might affect sizing. The best approach for a reasonable estimate
would be using empirical estimation with representative test data for
your projected growth.
Best option is to test in non-prod deployment!
Insert 1000 documents and check index sizes , insert 100000 documents and check index sizes and so one.
Easy way to check in a loop all collections total index sizes:
var y=0;db.adminCommand("listDatabases").databases.forEach(function(d){mdb=db.getSiblingDB(d.name);mdb.getCollectionNames().forEach(function(c){s=mdb[c].stats(1024*1024).totalIndexSize;y=y+s;print("db.Collection:"+d.name+"."+c+" totalIndexSize: "+s+" MB"); })});print("============================");print("Instance totalIndexSize: "+y+" MB");