Lucene.Net phrase count - lucene.net

As with Lucene.net count how many times the phrase is met (not the word "something", namely the phrase "Hi how are you") in the text?
I sorry for my English.

One way to do it is to use TermPositionVectors.
You basically get the positions for each of your query terms, and count the number of times they occur in the same order in your Document as they were in your Query.

Related

Lucene Wildcard Query - Length of matched string

I have set up a simple lucene.net index and am testing out a few queries.
I have an index with a field called "Biography" and i am running this query
WildcardQuery query = new WildcardQuery(new Term("Biography", "*anag*"));
This returns back matches for records with the word Management - which is great
If i search for this...
WildcardQuery query = new WildcardQuery(new Term("Biography", "*anagm*"));
then i get no results.
Here are the 2 strings i have in the index
"im good at project management"
"im good at programming and project management. i like managing things"
Is there a character limit to wildcard searching?
My usecase will be a free text search box for users - hence im not sure what they may type in and wanting to do a wildcard
The partial word "anagm" does not occur in either of your two sentences so returning 0 results should be the expected behavior:
"im good at project management"
"im good at programming and project management. i like managing things"
Which sentence did you think would match? and Why?
Lucene is more often used to match words or more specifically tokens from the original sentences. Doing wildcard matches with Lucene (as one might do with Sql) is quite a bit less common since leading with a wild card is not performant (just as it is not with sql either).

Can a $text search perform a partial match

I'm very confused by this behavior. It seems inconsistent and strange, especially since I've read that Mongo isn't supposed to support partial search terms in full text search. I'm using version 3.4.7 of Mongo DB Community Server. I'm doing these tests from the Mongo shell.
So, I have a Mongo DB collection with a text index assigned. I created the index like this:
db.submissions.createIndex({"$**":"text"})
There is a document in this collection that contains these two values:
"Craig"
"Dr. Bob".
My goal is to do a text search for a document that has multiple matching terms in it.
So, here are tests I've run, and their inconsistent output:
SINGLE TERM, COMPLETE
db.submissions.find({"$text":{"$search":"\"Craig\""}})
Result: Gets me the document with this value in it.
SINGLE TERM, PARTIAL
db.submissions.find({"$text":{"$search":"\"Crai\""}})
Result: Returns nothing, because this partial search term doesn't exactly match anything in the document.
MULTIPLE TERMS, COMPLETE
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bob\""}})
Result: Returns the document with both of these terms in it.
MULTIPLE TERMS, ONE PARTIAL
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bo\""}})
Result: Returns the document with both terms in it, despite the fact that one term is partial. There is nothing in the document that matches "Dr. Bo"
MULTIPLE TERMS, BOTH PARTIAL
db.submissions.find({"$text":{"$search":"\"Crai\" \"Dr. Bo\""}})
Result: Returns the document with both terms in it, despite the fact that both terms are partial and incomplete. There is nothing in the document that matches either "Crai" or "Dr. Bo".
Question
So, it all boils down to: why? Why is it, when I do a text search with a partial term with only a single value, nothing gets returned. When I do a text search with two partial terms, I get the matching result? It just seems so strange and inconsistent.
MongoDB $text searches do not support partial matching. MongoDB allows text search queries on string content with support for case insensitivity, delimiters, stop words and stemming. And the terms in your search string are, by default, OR'ed.
Taking your (very useful :) examples one by one:
SINGLE TERM, PARTIAL
// returns nothing because there is no world word with the value `Crai` in your
// text index and there is no whole word for which `Crai` is a recognised stem
db.submissions.find({"$text":{"$search":"\"Crai\""}})
MULTIPLE TERMS, COMPLETE
// returns the document because it contains all of these words
// note in the text index Dr. Bob is not a single entry since "." is a delimiter
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bob\""}})
MULTIPLE TERMS, ONE PARTIAL
// returns the document because it contains the whole word "Craig" and it
// contains the whole word "Dr"
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bo\""}})
MULTIPLE TERMS, BOTH PARTIAL
// returns the document because it contains the whole word "Dr"
db.submissions.find({"$text":{"$search":"\"Crai\" \"Dr. Bo\""}})
Bear in mind that the $search string is ...
A string of terms that MongoDB parses and uses to query the text index. MongoDB performs a logical OR search of the terms unless specified as a phrase.
So, if at least one term in your $search string matches then MongoDB matches that document.
To verify this behaviour, if you edit your document changing Dr. Bob to DrBob then the following queries will return no documents:
db.submissions.find({"$text":{"$search":"\"Craig\" \"Dr. Bo\""}})
db.submissions.find({"$text":{"$search":"\"Crai\" \"Dr. Bo\""}})
These now return no matches because Dr is no longer a whole word in your text index because it is not followed by the . delimiter.
You can do partial searching in mongoose database using mongoose external library called mongoose-fuzzy-search where the search text is broken in various anagrams.
for more information visit this link
User.fuzzySearch('jo').sort({ age: -1 }).exec(function (err, users) {
console.error(err);
console.log(users);
});

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

sphinx get non-stemmed results

When I query sphinx, it first applies stemming to my input keywords and gives me back a set of results with words that matched. The problem is that the result keywords are also stemmed.
Is there a way to get back from sphinx the original search keyword and not the stemmed one. For example, if I do a batch query with the following words:
credit card
working
walked
and suppose sphinx found a match for credit card. The problem is that sphinx returns me "cred card" and I have to manually check (by comparing strings) with which one of the above keywords the document(s) matched. And this could be very inefficient in my circumstances.
Any suggestion?

mongoDB group by/distinct query

model checkin:
checkin
_id
interest_id
author_id
I've got a collection of checkins (resolved by simple "find" query)
I'd like to count the number of checkins for each interest.
What makes the task a bit more difficult - we should count two checkins from one person and one interest as one checkin.
AFAIK, group operations in mongo are performed by map/reduce query. Should we use it here? The only idea I've got with such an approach is to aggregate the array of users for each interest and then return this array's length.
EDIT I ended up with not using map/reduce at all, allthough Emily's answer worked fine & quick.
I have to select only checkins from last 60 minutes, and there shouldn't be too many results. So I just get all of them to Ruby driver, and do all the calculation's on ruby side. It's a bit slower, but much more scalable and easy-to-understand.
best,
Roman
Map reduce would probably be the way to go for this and you could get the desired results with two map reduces.
In the first, you could remove duplicate author_id and interest_id pairs.
key would be author_id and interest_id
values would be checkin_id
The second map reduce would just be a count of the number of checkins by a given author_id.
key would be author_id
value would be checkin_id count