How to split sphinx index without reindexing? - sphinx

I have one real-time index. I use attributes and full-text fields. I need to split index in several indexes for load balancing. I can reindex original data, but retrieving original data from storage is quite expensive. Is it possible to split index without reindexing?

AFAIK no, this is not possible. (at least there are no provided utilities, would have to write your own)

Related

How, When and Where Should MongoDB Index Types be Used?

Can any one help me when it is important to use MongoDB Index and where it can be used. Also I need advantages disadvantages of using MongoDB Index?
Can anyone help me when it is important to use MongoDB Index and where it can be used?
Indexes provide efficient access to your data.
Without having indexes in place for your queries, the query can scan more number of documents that it is expected to return. Having good indexes in place avoid scanning collections and more documents that what's required to return.
A well-designed set of indexes that cater the incoming queries to your database can significantly improve the performance of your database.
Also, I need disadvantages of using MongoDB Index
Indexes need memory and space to store. If the indexes are part of your working set. they will be stored in memory. Meaning that you may need sufficient memory to store indexes in-memory along with frequently accessed data.
Every update, delete and write operation needs update to the index data structure. Having too many indexes on a collection that involves keys in write, update or delete operation needs update to an existing index. It adds the penalty to write operations.
Having large number of compound index take more time to restore index in large datasets.

are hashed indexes in mongodb field-size limited?

in our DB we have a large text field which we want to filter on exists/does not exist basis. So we don't need to perform any text search in it.
we assume that index would help, although it's not guaranteed the fiels wont exceed 1024 bytes. So that's not an option.
does hashed index on such field support $exists-filtering queries?
do hashed indexes impose any field-size limitations (in our experiments, hashed index is well capable of indexing fields where ordinary index fails)? We haven't found any explicit statement on this in docs though.
is chosen approach as a whole the correct one?
Yes, your approach is the correct one given the constraints. However, there are some caveats.
The performance advantage of an index compared to a collection scan is limited by the RAM available, since mongod tries to keep indices in RAM. If it can't (die to queries, for example), even an index will be read from disk, more or less eliminating the performance advantage in using it. So you should test wether the additional index does not push the RAM needed beyond the limits of your planned deployment.
The other, more severe problem is that you can not use said index to reliably distinguish unique documents with it, since there is no guarantee for uniqueness on hashes. Albeit a bit theoretical, you have to keep that in mind.

MongoDB multiple compound indexing will affect performance?

Is creating multiple compound indexes for serving various types of queries is better?
or
Is it better to
use a single compound index in a way that supports multiple queries(which is hard to analysis and construct, since there are many number of queries).
My basic question is "Does creating multiple compound indexes will slow down read/write operations?"
Please suggest me a solution.
There isn't any answer that fits all cases, but in general adding the right indexes will give you better performance. You will have less reads when accessing data. Calculating the index will cost you some performance, however if they are correct and used your db will perform better afterwards. Start with monitoring: mongodb monitoring docs
Indices will slow down writes but speed up reads. A high read to write ratio warrants one or more indices on commonly fetched fields (keys). For example our current system sees 25 writes to 20,000 reads (tps) so indices are beneficial to counter the wide margin. That being said, be mindful of retaining the mongo write lock as short as possible.
MongoDB uses a readers-writer 1 lock that allows concurrent reads
access to a database but gives exclusive access to a single write
operation. mongodb docs

Can Lucene store more than 100Gb original's documents in index?

I'm writing application what will be manipulate with more than 100Gb text documents. The size of each document is 2Kb-100Kb.
At first I supposed to use DBMS such as MySQL or Firebird to store raw documents with storing index in lucene's index. This approach have some disadvantages. For example, database transactions know nothing about lucene index and vice versa. So I need to synchronize them.
Then I supposed what Lucene can store entire documents in index. So I need regulary create index's backups. But it so easy: I can copy entire catalog with index. I use some kind of No SQL storage (i.e. Lucene). And I may don't use DBMS.
What is the best practice: to store original documents in index or not? I'm really don't want use DBMS to such purpose. Is it possible?
You would not want to store the raw document in a Lucene index, especially the size that you are talking about. I have done this a couple ways, but both ONLY store the indexed fields in the Lucene index and you have an ID/pointer to the raw document. I have dealt with indexes well over 100 million records and they work fine on a single server.
The reason this is important is that the build time of the index and manageability of the index dramatically drops if you don't need to store an additional 100 gig of data.
Basically, you need to index all the fields you need for searching/satisfying search queries. If a user clicks on the item in a grid, I assume you want to show the raw text (the UI pattern is that most of the time you will access a lot of the Lucene fields, but RARELY need to pull down the full binary text file).
The raw access I have used in conjunction with Lucene is:
SQL Server FILESTREAM, which is optimized for large binary file storage. It is really fast too. Not sure if MySQL has this (never worked with it)
Azure Table Storage, which is a key-value NoSQL cloud database. That was used to store the binary blobs.
It really doesn't matter what the persisted storage is, as long as it is optimized for larger binary files that can be accessed/streamed fast based off of a key. You could use an in-memory cache like Redis too as long as Lucene has the ID pointer to access the binary text file.

Best practices for combining Lucene.NET and a relational database?

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.
My plan is to build an index using Lucene for this form of search.
My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.
This could be done in two ways (That I can think of so far):
N amount of queries (Horrible)
Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.
I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.
Advice?
I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.
When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.
I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.
This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.
Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.
You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.
The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).