Sphinx query before and after a term - sphinx

Is it possible to set up a query in sphinx with a term that has to either also match a word before OR after?
(TermBefore) (Term) (TermAfter)
so that both
TermBefore Term
Term TermAfter
would match but
Term
does not?

The proximity search operator is pretty much designed for this
"Term TermAfter"~2
http://sphinxsearch.com/docs/current.html#extended-syntax
Ah, I thought you meant 'TermAfter' to be actully be the same word, just that it can be before or after.
But if two different terms, possibly the easiest is just to do:
"TermBefore Term" | "Term TermAfter"
Just simple phrase operator, where either phrase must match.
Edit again:
If dont want the matchs adjecent use Strict order operator, rather htna phrase operator...
(TermBefore << Term) | (Term << TermAfter)

Related

Regsubbing simple matches

I'm looking for a regsub example that does the following:
123tcl456TCL789 => 123!tcl!456!TCL!789
This is an Tcl example => This is an !Tcl! example
Yes, I could use string first to find a position and mash things but I saw in past a regsub command that does what I want but I can't recall. What would be the regsub command that allows that? I would guess regsub -all -nocase is a start.
I am bad at regsub and regexps. I wonder if there is a site or tool/script that we can supply a string, the final result and then we get the regsub form.
You're looking at the right tool, but there are various options, depending on exactly what the conditions are when faced with other text. Here's one that wraps each occurrence of "Tcl" (any capitalisation) with exclamation marks:
set inputString "123tcl456TCL789"
set replaced [regsub -all -nocase {tcl} $inputString {!&!}]
puts $replaced
That's using a very simple regular expression with the -nocase option, and the replacement means "put ! on either side of the substring matched".
Another (more generally applicable... perhaps) might be to put ! after any letter or number sequence that is followed by a number or letter.
set replaced [regsub -all {[A-Za-z]+(?=[0-9])|[0-9]+(?=[A-Za-z])} $inputString {&!}]
Note that doing things correctly typically requires understanding the real input data fairly well. For example, whether the numbers include floating point numbers in scientific notation, or whether the substrings to delimit are of fixed length.

Sphinx(Search) - documents which matches keyword twice (thrice, etc.)

Is there a way to only output documents which contains n matches of a search term in it?
F.e. I want to output all documents containing the search term "Pablo Picasso" | "Picasso Pablo" at least two (three, n) times.
How would such a query look like?
My current query is:
SELECT * FROM myIndex WHERE MATCH('"Pablo Picasso" | "Picasso Pablo"');
You could do it with filtering by weight (ie results with it multiple time wil rank higher)
But a useful trick is the Strict order operator...
MATCH('Pablo << Pablo')
would require the word twice (ie one before the other!)
You can also use the primoxity operator to simplify your original query, it just wants the words near each other, which is more conise than two phrase operators
MATCH('"Pablo Picasso"~1')
... ie within 1 word of each other - ie adjent.
Combine the two..
MATCH('"Pablo Picasso"~1 << "Pablo Picasso"~1')
and for theree occurances
MATCH('"Pablo Picasso"~1 << "Pablo Picasso"~1 << "Pablo Picasso"~1')

Wildcard searching between words with CRC mode in Sphinx

I use sphinx with CRC mode and min_infix_length = 1 and I want to use wildcard searching between character of a keyword. Assume I have some data like these in my index files:
name
-------
mickel
mick
mickol
mickil
micknil
nickol
nickal
and when I search for all record that their's name start with 'mick' and end with 'l':
select * from all where match ('mick*l')
I expect the results should be like this:
name
-------
mickel
mickol
mickil
micknil
but nothing returned. How can I do that?
I know that I can do this in dict=keywords mode but I should use crc mode for some reasons.
I also used '^' and '$' operators and didn't work.
You can't use 'middle' wildcards with CRC. One of the reaons for dict=keywords, the wildcards it can support are much more flexible.
With CRC, it 'precomputes' all the wildcard combinations, and injects them as seperate keywords in index, eg for
eg mickel as a document word, and with min_prefix_len=1, indexer willl create the words:
mickel
mickel*
micke*
mick*
mic*
mi*
m*
... as words in index, so all the combinations can match. If using min_infix_len, it also has to do all the combinations at the start as well (so (word_length)^2 + 1 combinations)
... if it had to precompute all the combinations for wildcards in the middle, would be a lot more again. Particularly if then allows all for middle AND start/end combinations as well)
Although having said that, you can rewrite
select * from all where match ('mick*l')
as
select * from all where match ('mick* *l')
because with min_infix_len, the start and end will be indexed as sperate words. Jus need to insist that both match. (although can't think how to make them bot match the same word!)

Sphinx search entire field but not begin/end

I am trying to match a field that contains all the words in a phrase but so far have only been able to use ^ and $ to do it. For instance
^Word1 Word2$
Returns a record named "Word1 Word2" but not "Word3 Word1 Word2".
However what I want in fact is also "Word2 Word1"
So I get how to use the ^ and $ to mean start and end of the field but that forces the words I put in to be in particular order. Clearly I could also search for "Word2 Word1" but it gets more complex (3+ word terms, etc)
Is there a way to tell sphinx to look in an entire field in any order. In other words I want "Word1 Word2" to match "Word1 Word2" and "Word2 Word1" but not "Word3 Word2 Word1"
Well can use NEAR/ or the proximity operator to require the words ajoining(but in any order), but there isnt really a good way to require 'entire field'.
Closest would probably to use index_field_lengths, then get the field len can use in a custom ranking expression. But if multiple fields in your index will be very tricky to implement.

Regular expression repeitition: how to match expressions of variable lengths?

Essentially, here's what I want to do:
if ($expression =~ /^\d{num}\w{num}$/)
{
#doSomething
}
where num is not an identifier, but could stand for any integer greater than 0 (\d and \w were arbitrarily chosen). I want to match a string iff it contains two groups of related characters, one group immediately followed by the other, and the number of characters in each group is the same.
For this example, 123abc and 021202abcdef would match, but 43abc would not, neither would 12ab3c or 1234acbcde.
Don’t think of the string as growing from left to right, but rather from the outside in:
xy
x(xy)y
xx(xy)yy
Your regex would then be something like:
/^(x(?1)?y)$/
Where (?1) is a reference to the outer pair of parentheses. ? makes it optional in order to give a “base case” of sorts to the recursive match. This is probably the simplest example of how regexes can be used to match context-free grammars—though it’s generally easier to get right with a parser generator or parser combinator library.
Well, there's
if ($expression =~ /^(\d+)([[:alpha:]]+)$/ && length($1)==length($2))
{
#doSomething
}
A regex isn't always the best option.