Does Algolia have a search with recommendation? - algolia

I was wondering if the Algolia service provides some kind of recommendations mechanism when doing a search.
I could not find in the API documentation anything related with providing the client with more refined and intelligent search alternatives based on the index data.
The scenario I am trying to describe is the following (this example is a bit over-the-top):
Given a user is searching for "red car" then the system provides more specific search alternatives and possibly related items that exist in the database (e.g. ferrari, red driving gloves, fast and furious soundtrack :-) )
Update
For future reference, I ended up doing a basic recommendation system using the text search capabilities of Algolia.
Summarising, when caris saved it's attributes color, speed, engine, etc, are used to create synonyms indexes, for example for engine ferrari in Engine index:
{
synonyms: ['red', 'ferrari', 'fast'],
value: 'ferrari'
}
Finally, the each index, must indicate the synonyms attribute for search and value as the returned result of a search.

Algolia does not provide that kind of "intelligence" out of the box.
Something you can do to approximate what you're looking for is using synonyms and a combination of other parameters:
Define synonyms groups such as "car,Ferrari,driving gloves", "red,dark red,tangerine,orange", ...
when sending a search query, set optionalWords to the list of words contained in that query. This will make each word of your query optional.
Also set removeStopWords to true so that words like "the", "a" (...) are ignored, to improve relevance.
With a well defined list of synonyms, this will make your original query interpreted as many other possibilities and thus increase the variety of possible results.
Be aware though that it could also impact the relevance of your results, as users might for instance not want to look for gloves when they search for a car!

Related

Approach for extracting relevant text using Azure Cognitive Search

Context:
I have a set of documents in SharePoint. I have set up Azure Cognitive Search (Standard tier) with data sources (SharePoint), index and indexers. I have also added a semantic configuration.
Outcome:
Ask a question, and have the search find and return relevant sections from the documents. I will use these sections to feed into OpenAI to construct a cohesive result.
I would like to replicate this Microsoft demo: https://www.youtube.com/watch?v=3t3qZu1Dy1k&t=572s It seems to me to create this 'demo' each document content is very small and they could easily be combined to pass into OpenAI.
My experience so far:
The results return the documents and rank them, which seems OK - however it returns a short 'caption' and the full text. The caption is not necessarily related to my question - and can therefore not be used for the next step. The full document is far too big to be used in OpenAI.
I have managed to get Semantic answers - however the question has to be so precise to get a result, and the associated text is limited.
What I would like:
I would like the search to return sub-sections of the document, where the results of my question may be. If that is not supported, I feel I need an entirely new approach.
Any ideas? Thanks in advance for your time.
The demo you refer to works by feeding documents to Azure Cognitive Search. A query is then formulated as a question that uses the Semantic Search functionality to return a set of potential semantic answers extracted from the content in the index.
These potential semantic answers are then fed as a prompt to OpenAI's text completion service: https://beta.openai.com/docs/guides/completion
First, you must ensure you can get good semantic answers. Inspect the content you have indexed and verify that it contains content that could semantically be an answer to the questions you test with. Good content should have declarations of facts. I.e., statements that could be used verbatim as an answer to a question. Examples:
The capital of France is Paris.
Forecast for 2022 is expected to be 22%.
The semantic functionality in Azure Search will only respond with a text section containing a potential answer to your question. If you can't get this step to work, you have to work on improving that. Either via semantic configuration, choice of content, or by making sure you process your content so that the items in your index contain the relevant content in the correct properties.
Ensure your content is indexed and mapped to properties in a sensible way
Work with the semantic configuration until you get sensible results
Once the previous two steps are ok, submit to OpenAI
I have tested the semantic text on two different data sets. Both were a combination of website content, PDF- and Word documents, etc. The topic and volume of content were essentially the same. From one data set, I could get excellent semantic answers. But, the other data set was disappointing.
My conclusion was that the content in the good data set was formulated and structured in a way that fits a semantic scenario. The other data set would often have logic and meaning presented in tables and layouts. As a human reading the content on paper, you would understand it. But, semantically, it would not make as much sense.

Elasticsearch - is there a method to match using "almost ident"

I use Facebook and Google maps to get a full Geo Entities data values (country, city, street, zip...).
I store these values on my mongoDB,
I noticed that some locations are deffer in the way they were written on Face and on Google, for (an unreal) example Face wrote the name of 'Hawaii' with an 'e' - Haweii.
I use match_all fields (country + city + street...) to search for entities at the same location but since some are written a bit different i will not find them.
Is there a way make elasticsearch search for 'Hawaii' and any other option that sounds like Hawaii but written a bit different?
Thanks for any help!
Using Google API one can get a location's
full details
To match words that sound similar you can use the phonetic analyzer. You can also give fuzzy query a try to match words with spelling mistakes. None of them are fool proof though and may result in false positives. Guess you'll have to experiment a little to come up with a solution that best fits your need.
If you have a known set of differences between Facebook and Google maps, you could look at using Synonyms at either index time or query time to accommodate differences in the APIs; There are merits to taking either approach.

Complex URL handling conception

I'm currently struggling at a complex URL handling concept question. The application have a product property database table/collection with all the different product types (i.e. categories, colors, manufacturers, materials, etc.).
{_id:1,alias:"mercedes-benz",type:"brand"},
{_id:2,alias:"suv-cars",type:"category"},
{_id:3,alias:"cars",type:"category"},
{_‌​id:4,alias:"toyota",type:"manufacturer"},
{_id:5,alias:"red",type:"color"},
{_id:6,alias:"yellow",type:"color"},
{_id:7,alias:"bmw",type:"manufacturer"},
{_id:8,alias:"leather",type:"material"}
...
Now the mission is to handle URL requests in the style below in every(!) possible order to retrieve the included product properties. The only allowed character is the dash (settled SEO requirement, some properties also can include dashes by themselve - i think also an important point - i.e. the category "suv-cars" or the manufacturer "mercedes-benz"):
http:\\www.example.com\{category}-{color}-{manufacturer}-{material}
http:\\www.example.com\{color}-{manufacturer}
http:\\www.example.com\{color}-{category}-{material}-{manufacturer}
http:\\www.example.com\{category}-{color}-nonexistingproperty-{manufacturer}
http:\\www.example.com\{color}-{category}-{manufacturer}
http:\\www.example.com\{manufacturer}
http:\\www.example.com\{manufacturer}-{category}-{color}-{material}
http:\\www.example.com\{category}
http:\\www.example.com\{manufacturer}-nonexistingproperty-{category}-{color}-{material}
http:\\www.example.com\{color}-crap-{manufacturer}
...
...so: every order of the properties should be allowed! The result have to be the information about the used properties per URL-Request (BTW yes, the duplicate content will be fixed by redirects and a predefined schema). The "nonexistingproperties"/"crap" are possible and just should be ignored.
UPDATE:
Idea 1: One way i'm thinking about the question is to split the query string by dashes and analyze them value by value, the problem: At the two or three or more word combinations at some properties there are too many different combinations and variations so a loooot of queries which kills this idea i think..
Idea 2: The other way is to build a (in my opinion) too large Alias/URL-Table with all of the different combinations, but i think that's just an ugly workaround. There are about 15.000 of different properties so the count of the aliases in the different sort orders is killing this idea.
Idea 3: It's your turn! Thanks for your mind and your time.
While your question is a bit broad, below are some ideas. There isn't a single awesome answer unless you find a free or commercial engine for this that works exactly the way you want.
The way I thought about your problem was to consider the URL as a list of keywords.
use Lucene as a keyword/tag system. It's good at the types of searches you suggest you want, including phrases, stems, etc.
store and index the data in DB of choice, but pull the keywords into memory and build a bit index of all keywords vs items. Iterate through the keyword table producing weighted results. If order of keywords matters, you'll also need make a pass through the result set to weight based on word order. These types of searches always need to cap their result set quickly in order to return results quickly.
cache the results like crazy from working matches, and give precedence to results that users seem to click on the most for a given URL.
attack the database by using tag indexes in MongoDB. You'd still need to merge and weight results. Very intensive and not likely a good use of DB resources.
read some of the academic papers on keyword searches. It's a popular topic.
build a table of words that have dashes in them, and normalize/convert those before running your queries
always check for full exact matches first
The only way this may work, if you restrict all property values to be unique. So, you make a set of categories+colors+manufacturers, etc. All values have to be unique. This will allow you to find to what property the value belongs.
The data structure for this should be fairly simple:
{_id:ValueOfTheProperty, Property:TypeOfProperty}
Here are some possible samples:
{ _id: Red, Property: Color }
{ _id: Green, Property: Color }
{ _id: Boots, Property: Category }
{ _id: Shoes, Property: Category }
...
This way, the order does not matter, and you are able to convert them in a single pass to a map:
{ Color: Red, Category: Boots }
Though, I predict some problems with ambigous names here.

How to auto-tag content, algorithms and suggestions needed

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.
I am now searching for ways to help me tag these articles with somewhat descriptive tags.
All these articles is accessible from a URL that looks like this:
http://web.site/CATEGORY/this-is-the-title-slug
So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.
My initial approach was doing this:
Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.
But this turned out to be a rather manual task, and not a very pretty or helpful approach.
This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.
Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.
You should use a metric such as tf-idf to get the tags out:
Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.
Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation
If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.
This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).
I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).
You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.
A couple questions to consider while you're looking at clustering systems:
Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.
Your approach seems sensible and there are two ways you can improve the tagging.
Use a known list of keywords/phrases for your tagging and if the count of the instances of this word/phrase is greater than a threshold (probably based on the length of the article) then include the tag.
Use a part of speech tagging algorithm to help reduce the article into a sensible set of phrases and use a sensible method to extract tags out of this. Once you have the articles reduced using such an algorithm, you would be able to identify some good candidate words/phrases to use in your keyword/phrase list for method 1.
If the content is an image or video, please check out the following blog article:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
If the content is a large text document, please check out this blog article:
Best Key Phrase Extraction APIs in the Market
http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/
Thanks, Scott
Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:
Elasticsearch - use a "tags" index to discover all tags in a given string
Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.
The below link is related ner of paper:
http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf
Also, The below link is key-phase extraction on twitter:
http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf

Lucene.NET - Search phrase containing "and"

Looking for advice on handling ampersands and the word "and" in Lucene queries. My test queries are (including quotes):
"oil and gas field" (complete phrase)
"research and development" (complete phrase)
"r&d" (complete phrase)
Ideally, I'd like to use the QueryParser as the input is coming from the user.
During testing and doc reading, I found that using the StandardAnalyzer doesn't work for what I want. For the first two queries, a QueryParser.Parse converts them to:
contents:"oil gas field"
contents:"research development"
Which isn't what I want. If I use a PhraseQuery instead, I get no results (presumably because "and" isn't indexed.
If I use a SimpleAnalyzer, then I can find the phrases but QueryParser.Parse converts the last term to:
contents:"r d"
Which again, isn't quite what I'm looking for.
Any advice?
if you want to search for "and" you have to index it. Write you own Analyzer or remove "and" from the list of stop words. The same applies to the "r&d". Write your own Analyzer that creates 3 words from the text: "r", "d", "r&d".
Step one of working with Lucene is to accept that pretty much all of the work is done at the time of indexing. If you want to search for something then you index it. If you want to ignore something then you don't index it. It is this that allows Lucene to provide such high speed searching.
The upshot of this is that for an index to work effectively you have to anticipate what your analyzer needs to do up front. In this case I would write my own analyzer that doesn't strip any stop words and also transforms & to 'and' (and optionally # to 'at' etc). In the case of r&d matching research & development you are almost certainly going to have to implement some domain specific logic.
There are other ways of dealing with this. If you can differentiate between phrase searches and normal keyword searches then there is no reason you can't maintain two or more indexes to handle different types of search. This gives very quick searching but will require some more maintenance.
Another option is to use the high speed of Lucene to filter your initial results down to something more manageable using an analyzer that doesn't give false negatives. You can then run some detailed filtering over the full text of those documents that it does find to match the correct phrases.
Ultimately I think you are going to find that Lucene sacrifices accuracy in more advanced searches in order to provide speed, it is generally good enough for most people. You are probably in uncharted waters trying to tweak your analyzer this much.