Merge realtime index and disk index - sphinx

I'm new to sphinx ( and full text searching in general ).
I've read that main+delta schemes are good for when you've got a large amount of data that doesn't change over time and some new data that is added.
So i have two indexes. One main index and one RT Index. The main index is indexed once using
>indexer --merge index_main index_rt --rotate
But i get this error
FATAL: Failed to merge index index_rt to index_main: source index preload failed: failed to open C:\path\to\index\index_rt.sph
( No such file or directory )
I'm guessing this is because real-time indexes are stored differently from disk indexes.
Is there a way to merge these indexes directly?
I may not need real time index updates. If so, Is it better to use a cron to update the delta index once every day and merge them weekly or so?

Yes, I dont think merging RT indexes is supported. They are already split into many parts.
Generally its a either/or, either use RT indexes, or use disk indexes (via main+delta).
If you want to use RT indexes, just have one big index - don't split into main+delta.
(Its no harder on the application, if you updating the small delta, using sphinxQL, you can just as easy update one big index)

Related

MongoDB Internal implementation of indexing?

I've learned a lot of things about indexing and finding some stuff from
here.
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the
query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
But i still have some questions:
While Creating index using (createIndex), is the Record always stored in
RAM?
Is every time need to create Index Whenever My application
is going to restart ?
What will Happen in the case of default id (_id). Is always Store in RAM.
_id Is Default Index, That means All Records is always Store in RAM for particular collections?
Please help me If I am wrong.
Thanks.
I think, you are having an idea that indexes are stored in RAM. What if I say they are not.
First of all we need to understand what are indexes, indexes are basically a pointer to tell where on disk that document is. Just like we have indexing in book, for faster access we can see what topic is on which page number.
So when indexes are created, they also are stored in the disk, But when an application is running, based on the frequent use and even faster access they get loaded into RAM but there is a difference between loaded and created.
Also loading an index is not same as loading a collection or records into RAM. If we have index loaded we know what all documents to pick up from disk, unlike loading all document and verifying each one of them. So indexes avoid collection scan.
Creation of indexes is one time process, but each write on the document can potentially alter the indexing, so some part might need to be recalculating because records might get shuffled based on the change in data. that's why indexing makes write slow and read fast.
Again think of as a book, if you add a new topic of say 2 pages in between the book, all the indexes after that topic number needs to be recalculated. accordingly.
While Creating index Using (createIndex),Is Record always store in RAM
?.
No, records are not stored in RAM, while creating it sort of processes all the document in the collection and create an index sheet, this would be time consuming understandably if there are too many documents, that's why there is an option to create index in background.
Is every time need to create Index Whenever My application is going to
restart ?
Index are created one time, you can delete it and create again, but it won't recreated on the application or DB restart. that would be insane for huge collection in sharded environment.
What will Happen in the case of default id (_id). Is always Store in
RAM.
Again that's not true. _id comes as indexed field, so index is already created for empty collection, as when you do a write , it would recalculate the index. Since it's a unique index, the processing would be faster.
_id Is Default Index, That means All Records is always Store in RAM for particular collections ?
all records would only be stored in RAM when you are using in-memory engine of MongoDB, which I think comes as enterprise edition. Due to indexing it would not automatically load the record into RAM.
To answer the question from the title:
MongoDB indexes use a B-tree data structure.
source: https://docs.mongodb.com/manual/indexes/index.html#b-tree

MongoDB: what’s happend when the mongodb drop indexes

MongoDB create index be slow in the large data collections,it's easy to understand. But why drop indexes operation so fast? Is there any changes of the data structure after executing drop indexes operation?
Creating a new index on a collection is like create an new collection which is arranged as B-Tree so you can do search on the key fields quickly. This will look like copying part of the collection.
So for deleting index, it will like deleting a collection, mongo just remove the index collection, and then it's done.
I am not sure if you know how file system work or not but you can consider this problem in the same way.When you copy file to a disk , it will take time. But if you remove file from a disk, it takes little time because file system just mark it unused, need almost zero time.

Creating index while updating the documents

I have a collection I am updating adding a new field.
The document looks like:
{"A": "P145", "B":"adf", "C":[{"df":"14", "color":"blue"},{"df":17}],
"_id":ObjectID(....), "Synonyms":{"Synonym1": "value1",
"Synonym2": ["value1", "value2"]}}
In the update I am adding new elements to C
I want to create a index on the field A and B. A and B are 20206 unique fields. The queries to the database will be based on these fields.
The "_id" is set by default.
I plan to do it with collection.ensure_index({"A":1, "B":1}, background=True)
How much time could it need? It will be faster than the system index based on "_id"?
The amount of time it takes to add the index would depend on your hardware, but with 20206 records a simple index as you describe shouldn't take very long for most hardware.
Queries fully covered by the index (i.e. where you specify A and B, or just A, but not just B - indexes cover from left to right so unless you include A in the select, the index can't be used) will be much faster to retrieve the results. Unless you are searching by _id, the default index on _id won't help you at all; queries on A and B will have to perform a full collection scan without your proposed index, which is orders of magnitude slower than an index scan.
Inserts will be slightly slower as the index will need to be updated too, but again with a relatively small number of total documents, this isn't likely be a large overhead.
The updates to change the C collection may well be faster if you are using A and B to identify which document to update, as they will benefit from the faster search, and the update should not be impacted once the data is found as the index should not need changing.
As the absolute performance will be specific to your hardware, if you're concerned about it the best thing to do is try it out on a copy of the data (on similar hardware) and measure whether the performance meets your needs. The output from explaining the query can be very informative in understanding how your indexes are impacting your query performance.
Well, the time taken to create the index totally depends on the hardware (system) you are using and the number of records. for ~20K records it should be quick and not take more time. max few seconds in worst case. Little off topic, but i see that you have given background true option, probably its not needed as these background options are used while create a very large data set.Please consider few things while creating index, not only for this question but in general.
when you create Index in foreground they block the operation and wouldn't allow the read operation and that the reason background true is used. http://docs.mongodb.org/v2.2/administration/indexes/
the good part with foreground index creation is that the indexes are more compact and better compare to background. hence it should be preferred.
The good news is over a long run, both background index creation and foreground delivers the same performance and does'nt matter which way the indexes were created. ... Happy Mongoing.. ;-)
-$

Why and when is necessary to rebuild indexes in MongoDB?

Been working with MongoDB for a while and today I had a doubt while discussing with a colleague.
The thing is that when you create an index in MongoDB, the collection is processed and the index is built.
The index is updated within insertion and deletion of documents so I don't really see the need to run a rebuild index operation (which drops the index and then rebuild it).
According to MongoDB documentation:
Normally, MongoDB compacts indexes during routine updates. For most
users, the reIndex command is unnecessary. However, it may be worth
running if the collection size has changed significantly or if the
indexes are consuming a disproportionate amount of disk space.
Does someone has had the need of running a rebuild index operation that worth it?
As per the MongoDB documentation, there is generally no need to routinely rebuild indexes.
NOTE: Any advice on storage becomes more interesting with MongoDB 3.0+, which introduced a pluggable storage engine API. My comments below are specifically in reference to the default MMAP storage engine in MongoDB 3.0 and earlier. WiredTiger and other storage engines have different storage implementations for data & indexes.
There may be some benefit in rebuilding an index with the MMAP storage engine if:
An index is consuming a larger than expected amount of space compared to the data. Note: you need to monitor historical data & index size to have a baseline for comparison.
You want to migrate from an older index format to a newer one. If a reindex is advisible this will be mentioned in the upgrade notes. For example, MongoDB 2.0 introduced significant index performance improvements so the release notes include a suggested reindex to the v2.0 format after upgrading. Similarly, MongoDB 2.6 introduced 2dsphere (v2.0) indexes which have a different default behaviour (sparse by default). Existing indexes are not rebuilt after index version upgrades; the choice of if/when to upgrade is left to the database administrator.
You have changed the _id format for a collection to or from a monotonically increasing key (eg. ObjectID) to a random value. This is a bit esoteric, but there's an index optimisation that splits b-tree buckets 90/10 (instead of 50/50) if you are inserting _ids that are always increasing (ref: SERVER-983). If the nature of your _ids changes significantly, it may be possible to build a more efficient b-tree with a re-index.
For more information on general B-tree behaviour, see: Wikipedia: B-tree
Visualising index usage
If you're really curious to dig into the index internals a bit more, there are some experimental commands/tools you can try. I expect these are limited to MongoDB 2.4 & 2.6 only:
indexStats command
storage-viz tool
While I don't know the exact technical reasons why, in MongoDB, I can make some assumptions about this, based on what I know about indexing from other systems and based on the documentation that you quoted.
The General Idea Of An Index
When moving from one document to the next, in the full document collection, there is a lot of wasted time and effort skipping past all the data that doesn't need to be dealt with. If you're looking for document with id "1234", having to move through 100K+ of each document makes it slow
Rather than having to search through all of the content of each document in the collection (physically moving the disk read heads, etc), an index makes this fast. It's basically a key/value pair that gives you the id and the location of that document. MongoDB can quickly scan through all of the id's in the index, find the locations of the documents that it needs, and go load them directly.
Allocating File Size For An Index
Indexes take up disk space because they are basically a key/value pair stored in a much smaller location. If you have a very large collection (large number of items in the collection) then your index grows in size.
Most operating systems allocate chunks of disk space in certain block sizes. Most database also allocate disk space in large chunks, as needed.
Instead of growing 100K of file size when 100K of documents are added, MongoDB will probably grow 1MB or maybe 10MB or something - I don't know what the actual growth size is. In SQL Server, you can tell it how fast to grow, and MongoDB probably has something like that.
Growing in chunks give the ability to 'grow' the documents in to the space faster because the database doesn't need to constantly expand. If the database now has 10MB of space already allocated, it can just use that space up. It doesn't have to keep expanding the file for each document. It just has to write the data to the file.
This is probably true of collections and indexes for collections - anything that is stored on disk.
File Size And Index Re-Building
When a large collection has a lot of documents added and removed, the index becomes fragmented. index keys may not be in order because there was room in the middle of the index file and not at the end, when the index needed to be built. Index keys may have a lot of space in between them, as well.
If there are 10,000 items in the index, and # 10,001 needs to be inserted, it may be inserted in the middle of the index file. Now the index needs to re-build itself to put everything back in order. This involves moving a lot of data around, to make room at the end of the file and put item # 10,001 at the end.
If the index is constantly being thrashed - lots of stuff removed and added - it's probably faster to just grow the index file size and always put stuff at the end. this is fast to create the index, but leaves empty holes in the file where old things were deleted.
If the index file has empty space where deleted things used to be, this is wasted effort when reading the index. The index file has more movement than needed, to get to the next item in the index. So, the index repairs itself... which can be time consuming for very large collections or very large changes to a collection.
Rebuild For A Large Index File
It can take a lot of disk access and I/O operations to correctly compact the index file back down to a reasonable size, with everything in order. Move out of place items to temp location, free up space in right spot, move them back. Oh by the way, to free up space, you had to move other items to temp location. It's recursive and heavy-handed.
Therefore, if you have a very large number of items in a collection and that collection has items added and removed on a regular basis, the index may need to be rebuilt from scratch. Doing this would wipe the current index file and rebuild from the ground up - which is probably going to be faster than trying to do thousands of moves inside of the existing file. Rather than moving things around, it just writes them sequentially, from scratch.
Large Change In Collection Size
Giving everything I'm assuming above, a large change in the collection size would cause this kind of thrashing. If you have 10,000 documents in the collection and you delete 8,000 of them... well, now you have empty space in your index file where the 8,000 items used to be. MongoDB needs to move the remaining 2,000 items around in the physical file, to rebuild it in a compact form.
Instead of waiting around for 8,000 empty spaces to be cleaned up, it might be faster to rebuild from the ground up with the remaining 2,000 items.
Conclusion? Maybe?
So, the documentation that you quoted is probably going to deal with "big data" needs or high thrashing collections and indexes.
Also keep in mind that I'm making an educated guess based on what I know about indexing, disk allocation, file fragmentation, etc.
My guess is that "most users" in the documentation, means 99.9% or more of mongodb collections don't need to worry about this.
MongoDB specific case
According to MongoDB documentation:
The remove() method does not remove the indexes
So if you delete documents from a collection you are wasting disk space unless you rebuild the index for that collection.

When create my index MongoDB

that might seems a stupid question, but when should I create an index on my collection ?
To be more explicit, I was wondering if I just have to create it once, when I create my collection, and then it will be updated automatically when I add some new documents. Or do I have to regenerate it regularly in background ?
The index will be kept up-to-date by MongoDB as you update/insert documents.
Performance-wise, do not create an index until you need it (to speed up queries). And when doing massive bulk-inserts, it may be more efficient to drop the index and recreate it after you are done inserting.
MongoDB will maintain any and all indexes itself, in other words only once.
This does, however, mean you need to be careful about just what indexes you ensure as each index will create significant overhead while performing write operations. The more indexes you have the more MongoDB will have to update to do a single write.