Is it possible to give more weighting to a search term in Lucene.Net - lucene.net

As per the title, I would like to give a weighting to search terms which are being sent to a Lucene.Net search.
The data is address data, so I would like to give a lower weighting to words like street and avenue, and also numbers.

Looks like Query.SetBoost is what I was looking for.
It allows you to give a higher weighting to a query term.

Related

Algolia search only by distance

Is it possible to rank search results based solely on proximity (and other filters) to a specific point and ignore the custom ranking formula (likes)?
For example, I'd like to rank results closest to times square strictly by distance and for the query to not care about the likes attribute.
I suppose the likes attributes is one of you own.
You can do what you need by removing the attribute like from the custom ranking.
If you need 2 way to search, you can do a replica index, with the same data but a different ranking formula: https://www.algolia.com/doc/guides/ranking/sorting/?language=go#multiple-sorting-strategies-for-an-index

Matrix factorization based recommendation for like/dislike/unknown data

Most literature focus on either explicit rating data or implicit (like/unknown) data. Are there any good publications to handle like/dislike/unknown data? That is, in the data matrix there are three values, and I'd like to recommend from unknown entries.
And are there any good open source implementations on this?
Thanks.
With like and dislike, you already have explicit rating data. You can use standard collaborative filtering with user and item normalization. You can also check out OrdRec: An Ordinal Model for Predicting Personalized Item Rating Distributions, which just takes an ordinal ranking of item ratings. That is, you can say that Like is better than Dislike, and let the algorithm figure out the best ranking-to-rating mapping before doing standard item-item collaborative filtering. Download LensKit and use the included OrdRec algorithm.

Cluster text documents in database

I do have 20.000 text files loaded in PostgreSQL database, one file in one row, all stored in table named docs with columns doc_id and doc_content.
I know that there is approximately 8 types of documents. Here are my questions:
How can I find these groups?
Are there some similarity, dissimilarity measures I can use?
Is there some implementation of longest common substring in PostgreSQL?
Are there some extensions for text mining in PostgreSQL? (I've found only Tsearch, but this seems to be last updated in 2007)
I can probably use some like '%%' or SIMILAR TO, but there might be better approach.
You should use full text search, which is part of PostgreSQL 9.x core (aka Tsearch2).
For some kind of measure of longest common substring (or similarity if you will), you might be able to use levenshtein() function - part of fuzzystrmatch extension.
You can use a clustering technique such as K-Means or Hierarchical Clustering.
Yes you can use the Cosine similarity between documents, looking at the binary term count, term counts, term frequencies, or TF-IDF frequencies.
I don't know about that one.
Not sure, but you could use R or RapidMiner to do the data mining against your database.

How to generate recommendation with matrix factorization

I've read some papers of Matrix Factorization(Latent Factor Model) in Recommendation System,and I can implement the algorithm.I can get the similar RMSE result like the paper said on the MovieLens dataset.
However I find out that,if I try to generate a top-K(e.g K=10) recommended movies list for every user by rank the predicted rating,it seems that the movies that are thought to be rated high point of all users are the same.
Is that just what it works or I've got something wrong?
This is a known problem in recommendation.
It is sometimes called "Harry Potter" effect - (almost) everybody likes Harry Potter.
So most automated procedures will find out which items are generally popular, and recommend those to the users.
You can either filter out very popular items, or multiply the predicted rating by a factor that is lower the more globally popular an item is.

Modifying Levenshtein Distance for positional Bias

I am using the Levenshtein distance algorithm to compare a company name provided as a user input against a database of known company names to find closest match. By itself, the algorithm works okay, but I want to build in a Bias so that the edit distance is considered lower if the initial parts of the strings match.
For Example, if the search criteria is "ABCD", then both "ABCD Co." and "XYX ABCD" have identical Edit Distance. However I want to add weight to the fact that the initial parts of the first string matches the search criteria more closely than the second string.
One way of doing this might be to modify the insert/delete/replace costs to be higher at the beginning of the strings and lower towards the end. Does anyone have an example of a successful implementation of this? Is using Levenshtein distance still the best way to do what I am trying to achieve? Is my assumption of the approach accurate?
UPDATE: For my immediate purposes I have decided to forgo the above and instead use the Jaro Winkler edit distance which seems to solve the problem. However I will leave this open for further inputs.
What you're looking for looks like a Smith-Waterman local alignment: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm