I have a large database of Greek Tweets stored in a mongodb database.
(3M Tweets arround 30GB of storage).
I have created a text index on the text and an ordered index on the timestamp fields. However, I found that MongoDB does not support the Greek language for text indexing thus text queries in the Greek language are relativelly slow. How can I face that issue and create an inverted index also for the greek documents?
Use solr to built you index rather than mongodb , it has lot of feature to support multi-lingual search .
I have just found that if I select as language none according to the documentation a simple inverted index using tokenization will be created.
http://docs.mongodb.org/manual/reference/text-search-languages/#text-search-languages
If you specify a language value of "none", then the text search uses
simple tokenization with no list of stop words and no stemming
Related
Currently, I have a MongoDB instance, which contains a collection with a lot of entities. Each entity contains a string attribute, which represents some text. My goal is to provide a strict text search in the collection. It should work as a MySQL query:
SELECT *
FROM texts
WHERE text LIKE '%test%';
MongoDB text index would be great, but it doesn't provide a strict search. How I could organize a strict search for such data? Could I do some optimization?
I already checked other software (such as ElasticSearch, Lucene, MongoDB, ClickHouse), but I haven't found options to do it. Searching as now took too much time.
In mongoDB you can do it as follow:
db.texts.find({ text:/test/ })
I want to have some sort of limited indexed Full-text search. With FTS postgres will index all the words in the text, but I want it to track only a given set of words. For example, I have a database of tweets, and I want them to be indexed by special words that I give: awesome, terrible and etc.
If someone will be interested in such a specific thing, I made it by creating a custom dictionary (thanks Mark).
My findings I documented here: https://hackmd.io/#z889TbyuRlm0vFIqFl_AYQ/r1gKJQBZS
I have a word "CenturyLink" which I want to force mongodb to stem to "Centuri" so that when I search for "Century" ,which stems to "Centuri", all the documents with "CenturyLink" are also returned.
So basically I want to add some custom set of key values of "word and it's stem" so that fts indexing is done on the basis of the provided key values.
Is this possible?
Read Query
In Posgres, Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
Parsing documents into tokens.
Converting tokens into lexemes.
Storing preprocessed documents optimized for searching.
tsvector type is used in Postgres for full text search
tsvector type is different than text type in below aspects:
Eliminates case. Upper/lower case letter are identical
Removes stop words ( and, or, not, she, him, and hundreds of others)-because these words are not relevant for text search
Replaces synonyms and takes word stems (elephant -> eleph). In the full text catalogue, it does not have the word elephant but the word elep.
Can (and should) be indexed with GIST and GIN
Custom ranking with weights & ts_rank
How Elastic search(search engine) has advantage over full text search in Postgres?
fulltext search and elasticsearch are both built on the same basic technology inverted indices so performance is going to be about the same.
FTS is going to be easier to deploy.
ES comes with lucene,
if you want lucene with FTS that will require extra effort.
We're using Postgres and its fulltext feature to search for documents (posts content) in our system, and it works really well.
For autocomplete we want to build index (dictionary?) with all words used in documents and search by most frequent ones.
We will always search for one word. We will never search for phrase.
So if I write:
"th"
I will receive (suppose the most frequent words in our documents):
"this"
"there"
"thoughts"
...
How to do it with Postgres? Or maybe we need some more advanced solution like apache lucene / solr ?
Neither postgres fulltext search (which provides lexems) nor postgres trigrams seems to be suitable for this work. Or maybe I am wrong ?
I don't want to manually parse text and ignore all english stopwords which would be error prone. Postgres does good job with this while building lexems index. But intead of lexems, we need to build and search words dictionary without normalization
Thank you for your assistance