How can I make Sphinx ignore some characters? - sphinx

I'm making a PHP website with MySQL backend and Sphinx as a search engine. Say, I have an item with the designer "Ray-Ban" and I need to get it as a result when the user types "ray ban" or "rayban". Should there be an exclusion list somewhere?

The standart way to do so is a charset_table option. charset_table defines characters that only have to be tokenized,
ie with this charset_table
index YOUR_INDEX_NAME
{
charset_table = 0..9, A..Z->a..z, _, a..z
such text
My best fiend is Hoo-foo but not Pe_ter.!!! That's all.
is parsed as these tokens
my best friend is hoo foo but not pe_ter that s all

Your best bet is probably the exceptions file - although that means you'll need to know every case where you want two different words/phrases to be treated the same.

As of version 0.9.8 there is an exclusion list option available per index named ignore_chars.
eg.
index YOUR_INDEX {
charset_type = utf-8
ignore_chars = -
More information available on the Sphinx website: http://sphinxsearch.com/docs/manual-0.9.8.html#conf-ignore-chars
Side note: they show using U+AD to remove soft-hyphens in their example. For some reason this didn't work for me, but the example I gave above worked fine.

Related

Simple `to_tsvector` configuration - postgres

How can I change the to_tsvector configuration to use a simple tokenization rule like:
lowercase
split by spaces only
Executing the following query:
SELECT to_tsvector('english', 'birthday=19770531 Name=John-Oliver Age=44 Code=AAA-345')
I get these lexemes:
'-345':9 '19770531':2 '44':6 'aaa':8 'age':5 'birthday':1 'code':7 'john':4 'name':3
The kind of searching I'm looking for is like:
(!birthday | birthday=19770531) & (code=AAA-345)
It means, get me all records that has a text "birthday=19770531" or doesn't have "birthday" at all, and a text equals to "code=AAA-345"). The way lexemes are being created it is not possible. I was expecting to have something like this:
'birthday=19770531':1 'age=44':2 'code=aaa-345':4 'name=john-oliver':3
You would have to code a custom parser. This can only be done in C.
But you might be able to use the existing testing parser test_parser, it seems to do what you want. If not, it would at least be a good starting point.
The problem may be that this is in src/test/modules/, and I don't think it ships with most installation packaging. So it might take some effort to get it to install. It would depend on your OS, version, and package manager.

TYPO3 queryBuilder: How to work with BINARY in where() clause?

I just have a short question.
There's no description in the following API overview of TYPO3 how to use a "BINARY" in where() clause: https://docs.typo3.org/m/typo3/reference-coreapi/master/en-us/ApiOverview/Database/QueryBuilder/Index.html#expr
What I want to achieve? this one:
WEHRE BINARY `buyer_code` = "f#F67d";
Actually I can only do the following:
->where(
$queryBuilder->expr()->eq('buyer_code', 'f#F67d')
);
But in this case I don't get a satisfying result for myself because I need case-sensitive here :-)
An another buyer_code exists "f#F67D" (the last char is uppercase) but I do need to look for the other one.
Thanks for helping.
Since TYPO3 is using Doctrine API here, you could try to do
->where('BINARY `buyer_code` = ' . $queryBuilder->createNamedParameter('f#F67d'))
Please keep in mind, that this query now only works for database backends, supporting the BINARY keyword!
Please have a look at Doctrine2 case-sensitive query The thread is a bit older, but seems to cover background and solution for your problem.

How to allow leading wild cards in custom smart search web part (Kentico 10)

I have a custom index for my products and I am using the Subset Analyzer. This Analyzer works great, but if you do field searches, it does not work.
For example, I have a document with the following fields:
"documentname", "My-Document-Name"
"tags", "1234,5678,9101"
"documentdescription", "This is a great Document, My-Document-Name."
When I just search "name AND tags:(1234)", I get this document in my results because it searches +_content:name.
-- However:
When I search "documentname:(name)^3.0 AND tags:(1234)", I do not get this document in my results.
Of course, when I do "documentname:(*name*)^3.0" I get a parse error saying: '*' or '?' not allowed as first character in WildcardQuery.
How can I either enable wildcard query in my custom CMS.Search webpart?
First of all you have to make sure that a field you checking is in the index with proper name. documentname might not be in the index it can be called _title, depends how you index is set up. Get lukeall and check your index (it should be in \CMS\App_Data\CMSModules\SmartSearch\YourIndexName). You can use luke to test your searches as well.
For examples there is no tags but there is documenttags field.
P.S. Wildcards are working and you are right you can't use them as a first character by default (lucene documentation says: You cannot use a * or ? symbol as the first character of a search), but there is a way to set it up in lucene.net, although i dont know if there are setting for that in Kentico. But i dont think you need wildcards, so your query should be (assuming you have documentname and documenttags in the index):
+(documentname:"My-Name" AND documenttags:"tag1")

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

How to use keywords include ampersand(&) in Facebook Search API

I want to use some keywords that include special characters like & in Facebook search api. I tried the query below but I cannot get useful results. Is there any chance for this usage in search api? How should I build my search query?
My example queries and keywords are "H&M", "marks & spencer",
http://graph.facebook.com/search?type=post&limit=25&q="H&M"
http://graph.facebook.com/search?type=post&limit=25&q="marks & spencer"
My team worked on this forever, ended up finding this as a solution that provides relevant results for a query with an ampersand, such as 'H&M'.
%26amp%3b
This is the hex equivilent to &
So your example link would be
http://graph.facebook.com/search?type=post&limit=25&q="H%26amp%3bM"
We found the solution thanks to Creative Jar
You want %26 which is the URL encode for ampersand so
http://graph.facebook.com/search?type=post&limit=25&q="H%26M" http://graph.facebook.com/search?type=post&limit=25&q="marks %26 spencer"
Depending on your language, it may have a URL encoding function or you can just use string replacement.
It seems, that all of solutions suggested here are not working any more.
Searching for q=H%26%bM returns empty data set. The same for q=H%26M.
It must have changed recently, in last 2 months.
If you try to search for postings about H&M on Facebook site (type H&M in search, then "Show me more results" on the bottom of list and then public posts on menu on the left side) the list is empty.
The only query that returns any results is q=H&M but it is not helpful, as the results are irrelevant for that query.