Lucene.net limitations on the index files with large data - lucene.net

We are planning to use to lucene.net for search in order to have a google like search. We will have a huge amount of data which will be loaded from d/b which needs to be indexed. Is there any limitations on size of the indexing documents and what is the maximum size of a single index document be? How to distribute the size on an average size of each document?

There are no limitations with the data sizes you're hinting at, i'm running a lucene.net app that works with 100s of gb of indexed data without spliting the index (outside of how lucene naturally splits it without you asking).
Just add your data to your index and forget about any limitations, you're way bellow any potential issues. (but do read all their performance guidelines, the lucene ones as lucene.net is a direct port, all those tips apply).

Related

Does a Mongo full collection scan read every single word in a collection?

Let's say that you don't have something indexed for some legitimate reason (like maybe you maxed out the 64 allowable indexes) and you are searching for values within only certain fields.
To go extreme, let's say each object has an authorName field, bookTitles field, and bookFullText field (where the content of all their novels was collected.)
If there was no index and you looked for a list of authorNames, would it have to read through all the content of all the fields in the entire collection, or would it read just the authorName fields and the names but not content of the other fields?
Fields in a document are ordered. The server stores documents as lists of key-value pairs. Therefore, I would expect that, if the server is doing a collection scan and field comparison, that the server will:
Skip over all of the fields preceding the field in question, one field at a time (which requires the server to perform string comparisons over each field name), and
Skip over the fields after the field in question in a particular document (jump to next document in collection).
The above applies to comparisons. What about reads from disk?
The basic database design I am familiar with separates logical records (documents in case of MongoDB, table rows in a RDBMS) from physical pages. For performance reasons the database generally will not read documents from disk, but will read pages. As such, it seems unlikely to me that the database will skip over some of the fields when it maps documents to pages. I expect that when any field of a document is needed, the entire document will be read from disk.
Further supporting this hypothesis is MongoDB's 16 MB document limit. This is rather low, and I suspect is set such that the server can read documents into memory completely without worrying that they might be very large. Postgres for example distinguishes VARCHAR from TEXT types in where the data is stored - VARCHAR data is stored inline in the table row and TEXT data is stored separately, presumably to avoid this exact issue of having to read it from disk if any column value is needed.
I am not a MongoDB server engineer though so the above could be wrong.
BSON Documents are kept in the common case (wiredTiger snappy compressed) in 32KB blocks in 64MB(default size ) chunks on storage , in case your document compressed size is 48KB , two blocks 32KB each must be loaded in memory , to be uncompressed and searched for your non indexed field which is expensive operation , moreover if you search multiple documents usually they are not written in sequential blocks increasing the demands for IOPS to your backend storage , this is why it is best to do some initial analysis and index the fields you will search mostly and create indexes , indexes(B-tree) are very effective since they are kept most of the time in memory compressed ( prefix compression) and are very fast for field search.
There is text indexes in mongodb that are enough for some simple text searches or you can use regex expressions.
If you will do full text search most of the time you better have search engine like elasticsearch which support inverse indexes in front of the database since the inverse indexes have your full text results already calculated and can give you the results times faster than similar operation using standard B-tree indexes.
If you use ATLAS ( the mongodb cloud service ) there is already lucene engine(inverse index) integrated that can do the fulltext search for you.
I hope my answer throw some light on the subject ... :)

Database for filtering XML documents

I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.
It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Why and when is necessary to rebuild indexes in MongoDB?

Been working with MongoDB for a while and today I had a doubt while discussing with a colleague.
The thing is that when you create an index in MongoDB, the collection is processed and the index is built.
The index is updated within insertion and deletion of documents so I don't really see the need to run a rebuild index operation (which drops the index and then rebuild it).
According to MongoDB documentation:
Normally, MongoDB compacts indexes during routine updates. For most
users, the reIndex command is unnecessary. However, it may be worth
running if the collection size has changed significantly or if the
indexes are consuming a disproportionate amount of disk space.
Does someone has had the need of running a rebuild index operation that worth it?
As per the MongoDB documentation, there is generally no need to routinely rebuild indexes.
NOTE: Any advice on storage becomes more interesting with MongoDB 3.0+, which introduced a pluggable storage engine API. My comments below are specifically in reference to the default MMAP storage engine in MongoDB 3.0 and earlier. WiredTiger and other storage engines have different storage implementations for data & indexes.
There may be some benefit in rebuilding an index with the MMAP storage engine if:
An index is consuming a larger than expected amount of space compared to the data. Note: you need to monitor historical data & index size to have a baseline for comparison.
You want to migrate from an older index format to a newer one. If a reindex is advisible this will be mentioned in the upgrade notes. For example, MongoDB 2.0 introduced significant index performance improvements so the release notes include a suggested reindex to the v2.0 format after upgrading. Similarly, MongoDB 2.6 introduced 2dsphere (v2.0) indexes which have a different default behaviour (sparse by default). Existing indexes are not rebuilt after index version upgrades; the choice of if/when to upgrade is left to the database administrator.
You have changed the _id format for a collection to or from a monotonically increasing key (eg. ObjectID) to a random value. This is a bit esoteric, but there's an index optimisation that splits b-tree buckets 90/10 (instead of 50/50) if you are inserting _ids that are always increasing (ref: SERVER-983). If the nature of your _ids changes significantly, it may be possible to build a more efficient b-tree with a re-index.
For more information on general B-tree behaviour, see: Wikipedia: B-tree
Visualising index usage
If you're really curious to dig into the index internals a bit more, there are some experimental commands/tools you can try. I expect these are limited to MongoDB 2.4 & 2.6 only:
indexStats command
storage-viz tool
While I don't know the exact technical reasons why, in MongoDB, I can make some assumptions about this, based on what I know about indexing from other systems and based on the documentation that you quoted.
The General Idea Of An Index
When moving from one document to the next, in the full document collection, there is a lot of wasted time and effort skipping past all the data that doesn't need to be dealt with. If you're looking for document with id "1234", having to move through 100K+ of each document makes it slow
Rather than having to search through all of the content of each document in the collection (physically moving the disk read heads, etc), an index makes this fast. It's basically a key/value pair that gives you the id and the location of that document. MongoDB can quickly scan through all of the id's in the index, find the locations of the documents that it needs, and go load them directly.
Allocating File Size For An Index
Indexes take up disk space because they are basically a key/value pair stored in a much smaller location. If you have a very large collection (large number of items in the collection) then your index grows in size.
Most operating systems allocate chunks of disk space in certain block sizes. Most database also allocate disk space in large chunks, as needed.
Instead of growing 100K of file size when 100K of documents are added, MongoDB will probably grow 1MB or maybe 10MB or something - I don't know what the actual growth size is. In SQL Server, you can tell it how fast to grow, and MongoDB probably has something like that.
Growing in chunks give the ability to 'grow' the documents in to the space faster because the database doesn't need to constantly expand. If the database now has 10MB of space already allocated, it can just use that space up. It doesn't have to keep expanding the file for each document. It just has to write the data to the file.
This is probably true of collections and indexes for collections - anything that is stored on disk.
File Size And Index Re-Building
When a large collection has a lot of documents added and removed, the index becomes fragmented. index keys may not be in order because there was room in the middle of the index file and not at the end, when the index needed to be built. Index keys may have a lot of space in between them, as well.
If there are 10,000 items in the index, and # 10,001 needs to be inserted, it may be inserted in the middle of the index file. Now the index needs to re-build itself to put everything back in order. This involves moving a lot of data around, to make room at the end of the file and put item # 10,001 at the end.
If the index is constantly being thrashed - lots of stuff removed and added - it's probably faster to just grow the index file size and always put stuff at the end. this is fast to create the index, but leaves empty holes in the file where old things were deleted.
If the index file has empty space where deleted things used to be, this is wasted effort when reading the index. The index file has more movement than needed, to get to the next item in the index. So, the index repairs itself... which can be time consuming for very large collections or very large changes to a collection.
Rebuild For A Large Index File
It can take a lot of disk access and I/O operations to correctly compact the index file back down to a reasonable size, with everything in order. Move out of place items to temp location, free up space in right spot, move them back. Oh by the way, to free up space, you had to move other items to temp location. It's recursive and heavy-handed.
Therefore, if you have a very large number of items in a collection and that collection has items added and removed on a regular basis, the index may need to be rebuilt from scratch. Doing this would wipe the current index file and rebuild from the ground up - which is probably going to be faster than trying to do thousands of moves inside of the existing file. Rather than moving things around, it just writes them sequentially, from scratch.
Large Change In Collection Size
Giving everything I'm assuming above, a large change in the collection size would cause this kind of thrashing. If you have 10,000 documents in the collection and you delete 8,000 of them... well, now you have empty space in your index file where the 8,000 items used to be. MongoDB needs to move the remaining 2,000 items around in the physical file, to rebuild it in a compact form.
Instead of waiting around for 8,000 empty spaces to be cleaned up, it might be faster to rebuild from the ground up with the remaining 2,000 items.
Conclusion? Maybe?
So, the documentation that you quoted is probably going to deal with "big data" needs or high thrashing collections and indexes.
Also keep in mind that I'm making an educated guess based on what I know about indexing, disk allocation, file fragmentation, etc.
My guess is that "most users" in the documentation, means 99.9% or more of mongodb collections don't need to worry about this.
MongoDB specific case
According to MongoDB documentation:
The remove() method does not remove the indexes
So if you delete documents from a collection you are wasting disk space unless you rebuild the index for that collection.

Optimizing for random reads

First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.

Are there any tools to estimate index size in MongoDB?

I'm looking for a tool to get a decent estimate of how large a MongoDB index will be based on a few signals like:
How many documents in my collection
The size of the indexed field(s)
The size of the _id I'm using if not ObjectId
Geo/Non-geo
Has anyone stumbled across something like this? I can imagine it would be extremely useful given Mongo's performance degradation once it hits the memory wall and documents start getting paged out to disk. If I have a functioning database and want to add another index, the only way I'll know if it will be too big is to actually add it.
It wouldn't need to be accurate down to the bit, but with some assumptions about B-Trees and the index implementation I'm sure it could be reasonable enough to be helpful.
If this doesn't exist already I'd like to build and open source it, so if I've missed any required parameters for this calculation please include in your answer.
I just spoke with some of the 10gen engineers and there isn't a tool but you can do a back of the envelope calculation that is based on this formula:
2 * [ n * ( 18 bytes overhead + avg size of indexed field + 5 or so bytes of conversion fudge factor ) ]
Where n is the number of documents you have.
The overhead and conversion padding are mongo specific but the 2x comes from the b-tree data structure being roughly half full (but having allocated 100% of the space a full tree would require) in the worst case.
I'd explain more but I'm learning about it myself at the moment. This presentation will have more details: http://www.10gen.com/presentations/mongosp-2011/mongodb-internals
You can check the sizes of the indexes on a collection by using command:
db.collection.stats()
More details here: http://docs.mongodb.org/manual/reference/method/db.collection.stats/#db.collection.stats
Another way to calculate is to ingest ~1000 or so documents into every collection, in other words, build a small scale model of what you're going to end up within production, create indexes or what have you and calculate the final numbers based on db.collection.stats() average.
Edit (from a comment):
Tyler's answer
describes the original MMAP storage engine circa MongoDB 2.0, but this
formula definitely isn't applicable to modern versions of MongoDB.
WiredTiger, the default storage engine in MongoDB 3.2+, uses index
prefix compression so index sizes will vary based on the distribution
of key values. There are also a variety of index types and options
which might affect sizing. The best approach for a reasonable estimate
would be using empirical estimation with representative test data for
your projected growth.
Best option is to test in non-prod deployment!
Insert 1000 documents and check index sizes , insert 100000 documents and check index sizes and so one.
Easy way to check in a loop all collections total index sizes:
var y=0;db.adminCommand("listDatabases").databases.forEach(function(d){mdb=db.getSiblingDB(d.name);mdb.getCollectionNames().forEach(function(c){s=mdb[c].stats(1024*1024).totalIndexSize;y=y+s;print("db.Collection:"+d.name+"."+c+" totalIndexSize: "+s+" MB"); })});print("============================");print("Instance totalIndexSize: "+y+" MB");