Sphinx exact match error - sphinx

We have website that uses this query:
SELECT did, kid FROM top_keywords WHERE MATCH('#keyword "^EXAMPLE KEYWORD$"') LIMIT 0,
100;
It works great in 99% times, but with some encoding it doesn't work. Example:
SELECT did, kid FROM top_keywords WHERE MATCH('#keyword "^εργον$"') LIMIT 0, 100;
Produces error:
ERROR 1064 (42000): index top_keywords: syntax error, unexpected '$', expecting
TOK_KEYWORD or TOK_INT near 'εργον$"'
My sphinx version is 2.0.6.
My only idea is that has something to do with conf-charset-type.

I tried copy/pasting your word εργον into
http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder
It appears to be composed entirely of non-ascii UTF8 chars. (ie the codes are all 255+)
So, ALL those letters whould need to be in charset_table for it to work.
I'm guessing they are not in you charset_table (just setting charset_type=utf8 is NOT enough), in which case they are completely stripped, so the query becomes
SELECT did, kid FROM top_keywords WHERE MATCH('#keyword "^ $"') LIMIT 0, 100;
... as the letters are all taken as seperators, which clearly leaves you an invalid query.
Unfortunately I can't give you any good references for charset_table for international support (dont know any!), but perhaps start on the wiki http://sphinxsearch.com/wiki/doku.php?do=search&id=charset_table

Related

Sphinx search: multi-term wordforms not indexed correctly

I'm having an issue with specific entries in my wordforms file that are not being
interpreted as expected.
Here are a couple of examples:
1/48 > forty-eighth
1/96 > ninety-sixth
As you can see, these entries contain both slashes and hyphens, which may be related to
my issue.
For some reason, Sphinx doesn't correctly equate each fraction to the spelled out
version. Search results for "1/48" are not the same as for "forty-eighth", as they should
be. In other words, the mapping between these equivalent forms is not working.
In my Sphinx config, I have the forward slash (/) set as a blend character, so I assume
that the fraction is being recognized properly.
In support of that belief, the following wordforms entry does work correctly:
1/4 > fourth
Does anyone have any idea why my multi-term synonyms would not be working as expected?
I have tried replacing the hyphen with a space, but this doesn't change the result at
all. Would it help to change the order of the terms (i.e., on which side of the ">" they
should be placed)?
Thank you very much for any help.
When using characters in Sphinx it is always good to keep in mind the following:
By default, the Sphinx tokenizer handles unknown characters as whitespace
https://sphinxsearch.com/blog/2014/11/26/sphinx-text-processing-pipeline/
That has given me weird results too when using wordforms.
I would suggest you add the hyphen to charset_tables so ninety-sixth becomes one word. ignore_chars is also an option but then you will be looking for ninetysixth instead.
Much depends on the rest of your dataset and use cases ofcourse.

PQgetvalue() strips spaces from result of string_agg()

I have a GNU C++ project that uses the PostgreSQL API and for some reason, it strips spaces from the result of a certain query. Other environments (psql and pgAdmin) don't. The query is:
SELECT string_agg(my_varchar, ', ') FROM my table;
Notice the space after the comma in the delimiter. Instead of 1046976, 1046977 being returned by PQgetvalue(), I get 1046976,1046977. Just for kicks, I tried changing the delimiter to silly things like string_agg(my_varchar, ',:) ' and string_agg(my_varchar, ', :)'. It doesn't strip the space if the space is in the middle of the delimiter.
Again, I don't have this problem if I do the same queries in db browsers like psql and pgAdmin; they don't strip the space in any of those queries.
Yes, I considered the possibility that because the columns from which they extract are varchars, but the data are 7-bit integers, the engine might be confused. I changed the query to something that is truly a varchar, but the spaces were still stripped.
Looking at https://www.postgresql.org/docs/9.4/static/functions-aggregate.html, I see that string_agg() expects its arguments to be texts or byteas. Well, I never got an error, but to be sure, I tried string_agg(my_varchar::text, ', '::text). It didn't make a difference.
I don't know a great deal about this API, but it doesn't appear to connect to the db with any options, so I don't think there's much to say about the configuration.
I'm running this in GNU C++ v4.9.2 on Debian 8.10. The PostgreSQL engine and API are 9.4.

Weird behaviour with postgresql tsvector and tsquery around emails

I've been playing around with postgresqls text search capability and I've encountered what I consider weird behaviour. This is on postgresql 8.3 so it may not be current behaviour:
select to_tsvector('some#email.com') ## to_tsquery('some#email.com:*');
select to_tsvector('some#email.com') ## to_tsquery('some#email.c:*');
The first query matches but the second fails...
does anyone know what is going on here?
I've tried escaping the # and . characters but no luck

sphinx dash in author names causing problems when searching

I've read all the posts about dashes and tried pretty much everything mentioned in them, yet cannot figure out a strange problem I'm having.
For example, I have an author name like this:
Arturo Pérez-Reverte
A search for 'pérez-reverte' will not turn up anything, nor will 'pérez-reverte' so escaping the dash is not the issue.
But a search for 'spider-man' will return hits, proving that the dash seems to be working.
However, a search for 'perez reverte' also finds a hit because it searches each word separately and finds the 'reverte' in 'perez-reverte' (but doesn't seem to find the 'perez').
A search for either 'pérez' or 'perez' finds the same number of documents, suggesting that the accent is not an issue (I do have a charset_table which accounts for accented characters).
So I'm very confused as to what's happening here. It if it isn't the accent and it isn't the dash, what could it be?
I don't have any ignore_chars set, I'm using UTF-8 and have a charset_table to treat accented characters as regular characters.
The only difference between these two terms is that one of them is a title (spider-man) and the other an author, but they are both part of the same Sphinx index declaration, so I don't see that as an issue in any way.
Any help would be greatly appreciated.
After much fighting with it, I found out that even though my database is all UTF-8 with the proper collation I needed to add this in sphinx.conf for everything to work properly:
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER SET utf8
After doing that, and having the proper charset_table, everything seems to be working fine.
Hope this helps someone else.

Interpretation of Greek characters by FOP

Can you please help me interpret the Greek Characters with HTML display as HTML= & #8062; and Hex value 01F7E
Details of these characters can be found on the below URL
http://www.isthisthingon.org/unicode/index.php?page=01&subpage=F&hilite=01F7E
When I run this character in Apache FOP, they give me an ArrayIndexOut of Bounds Exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.fop.text.linebreak.LineBreakUtils.getLineBreakPairProperty(LineBreakUtils.java:668)
at org.apache.fop.text.linebreak.LineBreakStatus.nextChar(LineBreakStatus.java:117)
When I looked into the FOP Code, I was unable to understand the need for lineBreakProperties[][] Array in LineBreakUtils.java.
I also noticed that FOP fails for all the Greek characters mentioned on the above page which are non-displayable with the similar error.
What are these special characters ?
Why is their no display for these characters are these Line Breaks or TAB’s ?
Has anyone solved a similar issue with FOP ?
The U+1F7E code point is part of the Greek Extended Unicode block. But it is does not represent any actual character; it is a reserved but unassigned code point. Here is the chart from Unicode 6.0: http://www.unicode.org/charts/PDF/U1F00.pdf.
So the errors you are getting are perhaps not so surprising.
I ran a FO file that included the following <fo:block> through both FOP 0.95 and FOP 1.0:
<fo:block>Unassigned code point: ὾</fo:block>
I did get the same java.lang.ArrayIndexOutOfBoundsException that you are seeing.
When using an adjacent "real" character, there was no error:
<fo:block>Assigned code point: ώ</fo:block>
So it seems like you have to ensure that your datastream does not contain non-characters like U+1F7E.
Answer from Apache
At first glance, this seems like a minor oversight in the implementation of Unicode linebreaking in FOP. This does not take into account the possibility that a given codepoint is not assigned a 'class' in linebreaking context. (=
U+1F7E does not appear in the file
http://www.unicode.org/Public/UNIDATA/LineBreak.txt, which is used as a basis to generate those arrays in LineBreakUtils.java)
On the other hand, one could obviously raise the question why you so desperately need to have an unassigned codepoint in your output. Are you absolutely sure you need this? If yes, then can you elaborate on the exact reason? (i.e. What exactly is this unassigned codepoint used for?)
The most straightforward 'fix' seems to be roughly as follows:
Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java
--- src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (revision
1054383)
+++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (working
copy)
## -87,6 +87,7 ##
/* Initial conversions */
switch (currentClass) {
+ case 0: // Unassigned codepoint: consider as AL?
case LineBreakUtils.LINE_BREAK_PROPERTY_AI:
case LineBreakUtils.LINE_BREAK_PROPERTY_SG:
case LineBreakUtils.LINE_BREAK_PROPERTY_XX:
What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that has not been assigned a class by Unicode. This means it will be treated as a regular letter.
Now, the reason why I am asking the question whether you are sure you know what you're doing, is that this may turn out to be undesirable. Perhaps the character in question needs to be treated as a space rather than a letter.
Unicode does not define U+1F7E other than as a 'reserved' character, so it makes sense that Unicode cannot say what should happen with this character in the context of linebreaking...
That said, it is also wrong of FOP to crash in this case, so the bug is definitely genuine.