Are there more robust SphinxQL Diagnostics other than Show Meta? - sphinx

I have a pretty complex sphinx index.
Recently I was getting results on an important word in most of my searches and was getting false positives (meaning text records without the word at all).
In order to see what was going on I did show meta to see if there was some synonyn or other issue with the term which was causing the false results.
However Show Meta showed 1 keyword, the one I entered.
total 100000
total_found 1254244
time 6.856
keyword[0] book
docs[0] 1254244
hits[0] 3037375
Yet the word was found in only a small fraction of the 125k+ records found..
I'm wondering if there is some extension to or alternative SphinxQL to'Show Meta' that will give more information or where a good place to start looking for the cause of such an issue (since I'd think Meta would indicate it but does not).
I checked my cfg and the word is no where to be found (not mapped or referenced).
I checked stopwords and exceptions ditto.
The cfg settings are pretty basic:
exceptions = /etc/sphinxsearch/lemmatizer/exceptions.txt
stopwords = /etc/sphinxsearch/lemmatizer/stopwords.txt
stopword_step = 0
index_sp=1
min_word_len = 1
min_infix_len = 1
min_stemming_len = 1
#index_field_lengths = 1
html_strip = 1
enable_star = 1
So I'm not clear where to even start looking for the issue and was hoping there might be some other diagnostic tools more robust than "Show Meta"

Related

Splunk: How to get two searches in one timechart/graph?

I have to queries which look like this:
source="/log/ABCD/cABCDXYZ/xyz.log" doSomeTasks| timechart partial=f span=1h count as "#XYZ doSomeTasks" | fillnull
source="/log/ABCD/cABCDXYZ/xyz.log" doOtherTasks| timechart partial=f span=1h count as "#XYZ doOtherTasks" | fillnull
I now want to get this two searches in one graph (I do not want to sum the numbers I get per search up to one value).
I saw that there is the possibility to take appendcols but my trials to use this command were not successful.
I tried this but it did not work:
source="/log/ABCD/cABCDXYZ/xyz.log" doSomeTasks|timechart partial=f span=1h count as "#XYZ doSomeTasks" appendcols [doOtherTasks| timechart partial=f span=1h count as "#XYZ doOtherTasks" | fillnull]
Thanks to PM 77-1 the issue is solved.
This command works:
source="/log/ABCD/cABCDXYZ/xyz.log" doSomeTasks|timechart partial=f span=1h count as "#XYZ doSomeTasks" | appendcols[search source="/log/ABCD/cABCDXYZ/xyz.log" doOtherTasks| timechart partial=f span=1h count as "#XYZ doOtherTasks" | fillnull]
Note: You do not have to mention the source in the second search command if it is the same source as the first one.
General solution
Generate each data column by using a subsearch query in the following form:
|appendcols[search (myquery) |timechart count]
Additional steps
The list of one-or-more query columns needs to be preceded by a generated column which establishes the timechart rows (and gives appendcols something to append to).
|makeresults |timechart count |eval count=0
Note: It isn't strictly required to start with a generated column, but I've found this to be a clean and robust approach. Notably, it avoids problems that may occur in the special-case of "No results found", which otherwise can confuse the visualization rendering. Plus it's more uniform and, as a result, easier to work with.
Finally, specify each of the fields to be charted, with _time as the x-axis:
|fields _time, myvar1, myvar2, myvar3
Complete example
|makeresults |timechart span=5m count |eval count=0
|appendcols[search (myquery1) |timechart span=5m count as myvar1]
|appendcols[search (myquery2) |timechart span=5m count as myvar2]
|appendcols[search (myquery3) |timechart span=5m count as myvar3]
|fields _time, myvar1, myvar2, myvar3
Be careful to use the same span throughout.
Other hints
When comparing disparate data on the same chart, perhaps to evaluate their relative timing, it's common to have differences in type or scale that can render the overlaid result nearly useless. For cases like this, don't neglect the 'Log' format option for the Y-Axis.
In some cases, it may even be worthwhile to employ data hacks with eval to massage the values into a visual comparable state. For example, appending |eval myvar1=if(myvar1=0,0,1) deduplicates values when used following timechart count. Here's some relevant docs:
Mathematical functions
Comparison and Conditional functions

Sphinxsearch min_infix_len = 1 is disabled by force on 2.2.x?

I had a previous version of SphinxSearch that worked like charm. Was fast and the results were accurate for me. After upgrading to 2.2.10 many changes occurred on that release that made the search results much worse.
Now if I am searching for example "Lenovo y" from existing "lenovo y5070" I get no results although I have in my config:
min_word_len = 1
min_infix_len = 1
searching for "Lenovo y5" does work fine so to me it seems that the infix is forced to use "2" instead of 1. This is very bad for my search results. Any suggestions?
Try to add expand_keywords = 1 with index_exact_words = 1

Unexpected results when star enabled

I have an index that looks like this:
index user_core
{
source = user_core_0
path = ...
charset-type = utf-8
min_infix_length = 3
enable_star = 1
}
We escape and wrap all of our searches in asterisks. Every so often, we'll come across a
very strange case in which something such as the following happens:
Search: mocuddles
Results: All users with nicknames containing "yellowstone".
This behavior seems unpredictable, but will happen every time on terms it does effect.
I've been told that there's no real way to debug Sphinx indexes. Is this true? Is there
any sort of "explain query" functionality?
I've confirmed at this point that these are instances of CRC32 hash collisions. Bummer.

TermQuery not returning on a known search term, but WildcardQuery does

Am hoping someone with enough insight into the inner workings of Lucene might be able to point me in the right direction =)
I'll skip most of the surrounding irellevant code, and cut right to the chase. I have a Lucene index, to which I am adding the following field to the index (variables replaced by their literal values):
document.Add( new Field("Typenummer", "E5CEB501A244410EB1FFC4761F79E7B7",
Field.Store.YES , Field.Index.UN_TOKENIZED));
Later, when I search my index (using other types of queries), I am able to verify that this field does indeed appear in my index - like when looping through all Fields returned by Document.GetFields()
Field: Typenummer, Value: E5CEB501A244410EB1FFC4761F79E7B7
So far so good :-)
Now the real problem is - why can I not use a TermQuery to search against this value and actually get a result.
This code produces 0 hits:
// Returns 0 hits
bq.Add( new TermQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
But if I switch this to a WildcardQuery (with no wildcards), I get the 1 hit I expect.
// returns the 1 hit I expect
bq.Add( new WildcardQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
I've checked field lengths, I've checked that I am using the same Analyzer and so on and I am still on square 1 as to why this is.
Can anyone point me in a direction I should be looking?
I finally figured out what was going on. I'm expanding the tags for this question as it, much to my surprise, actually turned out to be an issue with the CMS this particular problem exists in. In summary, the problem came down to this:
The field is stored UN_TOKENIZED, meaning Lucene will store it excactly "as-is"
The BooleanQuery I pasted snippets from gets sent to the Sitecore SearchManager inside a PreparedQuery wrapper
The behaviour I expected from this was, that my query (having already been prepared) would go - unaltered - to the Lucene API
Turns out I was wrong. It passes through a RewriteQuery method that copies my entire set of nested queries as-is, with one exception - all the Term arguments are passed through a LowercaseStrategy()
As I indexed an UPPERCASE Term (UN_TOKENIZED), and Sitecore changes my PreparedQuery to lowercase - 0 results are returned
Am not going to start an argument of whether this is "by design" or "by design flaw" implementation of the Lucene Wrapper API - I'll just note that rewriting my query when using the PreparedQuery overload is... to me... unexpected ;-)
Further teachings from this; storing the field as TOKENIZED will eliminate this problem too, as the StandardAnalyzer by default will lowercase all tokens.

Generate unique 3 letter/number code and compare to existing ones in PHP/MySQL

I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?
I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.
I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.
I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)