MongoDB. [Key Too Large To Index] - mongodb

Some Background: I'm planning to use MongoDB as the publishing frontend db for a few of my websites. The actual data will be kept in a SQL Server db and there will be background jobs that will populate the MongoDB at predefined time intervals for readonly purposes to boost website performance.
The Situation: I have a table 'x' that i translated into a mongo collection, everything worked fine.
'x' has a column 'c' that was originally a NVARCHAR(MAX) in the source db and has multilingual text in it.
When I was searching by column 'c', mongo was doing fullscan on the collection.
So I tried doing an ensureIndex({c : 1 }) which worked but when I checked the mongodb logs it showed me that 90% of the data could not be indexed as [Key Too Large To Index] !!
And thus is has indexed 10% of the data and now only returns results from that 10% !!
What are my alternatives ??
Note: I was using this column to do full text searching in SQL Server, now im not sure if I should go ahead with Mongo or not :(

Try to run your mongod process with this parameter:
sudo mongod --setParameter failIndexKeyTooLong=false
And than try again.

if you need to search text inside a large string you can use one of those:
keyword splitting
regular expression
the former has the downside that you need some "logic" to combine the keyword to make a search, the latter heavily impacts on performance.
probably if you really need full text search the best option is to use an external indexer like solr or lucene.

Since you can do some elaboration, you could extract some key words and put them in a field:
_keywords : [ "mongodb" , "full search" , "nosql" ]
and make an index on that.

Don't use mongo for full text searching
its not designed for that. Yes obviously you will get an error key too large on indexing for long string values.
Better approach would be using full text search servers (solr/lucene or sphinx) if your main concern is search.

Recent (2.4 and above) MongoDB builds provide a couple other options:
As the OP's stated desire is for full text search, the right approach would be to use a text index which directly supports that use case.
For an exact match index on long string values you can use a hashed index.

Related

MongoDB. Searching substring in string field

Currently, I have a MongoDB instance, which contains a collection with a lot of entities. Each entity contains a string attribute, which represents some text. My goal is to provide a strict text search in the collection. It should work as a MySQL query:
SELECT *
FROM texts
WHERE text LIKE '%test%';
MongoDB text index would be great, but it doesn't provide a strict search. How I could organize a strict search for such data? Could I do some optimization?
I already checked other software (such as ElasticSearch, Lucene, MongoDB, ClickHouse), but I haven't found options to do it. Searching as now took too much time.
In mongoDB you can do it as follow:
db.texts.find({ text:/test/ })

Autocomplete and text search memory issues in apostrophe-cms: need ideas

I’m having trouble to use the text search and the autocomplete because I have a piece with +87k documents, some of them being big (~3.4MB of text).
I already:
Removed every field from the text index, except title , searchBoost and seoDescription ; these are the only fields copied to highSearchText and the field lowSearchText is always set to an empty string.
Modified the standard text index, including the fields type, published and trash in the beginning of it. I'm also modified the queries to have equality conditions on these fields. The result returned by the command db.aposDocs.stats() shows:
type_1_published_1_trash_1_highSearchText_text_lowSearchText_text_title_text_searchBoost_text: 12201984 (~11 MB, fits nicely in memory)
Verified that this index is being used, both in ‘toDistinc’ query as well in the final ‘toArray’ query.
What I think is the biggest problem
The documents have many repeated words in the title, so if the user types a word present in 5k document titles, the server suffers.
Idea I'm testing
The MongoDB docs says that to improve performance the entire collection must fit in RAM (https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs, last bullet).
So, I created a separate collection named “search” with just the fields highSearchText (string, indexed as text) and highSearchWords (array, also indexed), which result in total size of ~ 19 MB.
By doing the same operations of the standard apostrophe autocomplete in this collection, I achieved much faster, but similar results.
I had to write events to automatically update the search collection when the piece changes, but it seems to work until now.
Issues
I'm testing this search collection with the autocomplete. For the simple text search, I’m just limiting the sorted response to 50 results. Maybe I'll have to use the search collection as well, because the search could still breaks.
Is there some easier approach I'm missing? Please, any ideas are welcome.

Autocomplete by most frequent words - postgres or lucene?

We're using Postgres and its fulltext feature to search for documents (posts content) in our system, and it works really well.
For autocomplete we want to build index (dictionary?) with all words used in documents and search by most frequent ones.
We will always search for one word. We will never search for phrase.
So if I write:
"th"
I will receive (suppose the most frequent words in our documents):
"this"
"there"
"thoughts"
...
How to do it with Postgres? Or maybe we need some more advanced solution like apache lucene / solr ?
Neither postgres fulltext search (which provides lexems) nor postgres trigrams seems to be suitable for this work. Or maybe I am wrong ?
I don't want to manually parse text and ignore all english stopwords which would be error prone. Postgres does good job with this while building lexems index. But intead of lexems, we need to build and search words dictionary without normalization
Thank you for your assistance

Mongodb indexing: 'key too large to index'

I have a dataset whose schema is like this:
{..., "url":"www.google.com", "time::143703672, "geo":"US-NJ", ...}
I want search documents by: (I'm using pymongo)
data.find({'url':url, 'geo':user_geo, 'time':{"$gt": userlog_gmt_time - 10, "$lt": userlog_gmt_time + 10}})
I tried to build a compound index:
db.AdxRevenueData_Jul.createIndex({url:1, geo:1, time:1})
However, I receive an error "key too long to index". The reason is that some url is extremely long. I believe that there are some "wrong" url in the dataset.
Then, I tried to build a compound index based on only 'geo' and 'time'. I thought I could iterate the return result and find the documents with the url. However, this method was too slow...
Someone suggested me to set parameters to skip those long urls.
sudo mongod --setParameter failIndexKeyTooLong=false
But, I am not sure if all long urls are wrong. I do not want to skip a lot documents.
My question is, is there any other solution, except changing DB?
For instance, should I use "text" index? Will it work?
db.AdxRevenueData_Jul.createIndex({url:'text', geo:1, time:1})
Or,
Should I additionally build url as a 'hashed' single index? Can it be a compromise?
db.AdxRevenueData_Jul.createIndex({url:'hashed'})
db.AdxRevenueData_Jul.createIndex({geo:1, time:1})
Using text index will not be helpful there.
I would recommend to use hashed index for this field.

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.