Can Sphinx query push Integer patterns? - sphinx

In regex for instance if I wanted to search on up to five digits in front of my term I'd do
[0-9]\d{0,5} (Term)
can I do this with Sphinx and if so how would I return the fact that I matched Term AND the specific integer as well?

Sphinx doesnt have a digit only wildcard.
There are wildcards,
http://sphinxsearch.com/docs/current.html#conf-min-infix-len
but they match any char in charset_table.

Related

perl regex - pattern matching

Can anyone explain what is being done below?
$name=~m,common/([^/]+)/run.*/([^/]+)/([^/]+)$,;
common, run and / are match themselves.
() captures.
[^/]+ matches 1 or more characters that aren't /.
.* matches 0 or more characters that aren't Line Feeds.[1]
$ is equivalent to (\n?\z).[2]
\n optionally matches a Line Feed.
\z matches the end of the string.
I think it's trying to match a path of one or both of the following forms:
.../common/XXX/runYYY/XXX/XXX
common/XXX/runYYY/XXX/XXX
Where
XXX is a sequence of at least one character that doesn't contain /.
YYY is a sequence of any number of characters (incl zero) that doesn't contain /.
It matches more than that, however.
It matches uncommon/XXX/runYYY/XXX/XXX
It matches common/XXX/runYYY/XXX/XXX/XXX/XXX/XXX/XXX
The parts in bold are captured (available to the caller).
When the s flag isn't used.
When the m flag isn't used.

Text file search for match strings regex

I am trying to understand how regex works and what are the possibilities of working with it.
So I have a txt file and I am trying to search for 8 char long strings containing numbers. for now I use a quite simple option:
clear
Get-ChildItem random.txt | Select-String -Pattern [0-9][a-z] | foreach {$_.line}
It sort of works but I am trying to find a better option. ATM it takes too long to read through the left out text since it writes entire lines and it does not filter them by length.
You can use a lookahead to assert that a string contains at least 1 digit, then specify the length of the match and finally anchor it with ^ (start of string) and $ (end of string) if the string is on a line of its own, or \b (word boundary) if it's part of an HTML document as your comments seem to suggest:
Get-ChildItem C:\files\ |Select-String -Pattern '^(?=.*\d)\w{8}$'
Get-ChildItem C:\files\ |Select-String -Pattern '\b(?=.*\d)\w{8}\b'
The pattern [0-9][a-z] matches a digit followed by a letter. If you want to match a sequence of 8 characters use .{8}. The dot in regular expressions matches any character except newlines. A number in curly brackets matches the preceding expression the given number of times.
If you want to match non-whitespace characters use \S instead of .. If you want to match only digits and letters use [0-9a-z] (a character class) instead of ..
For a more thorough introduction please go find a tutorial. The subject is way too complex to be covered by a single answer on SO.
What you're currently searching for is a single number ranging from 0-9 followed by a single lowercase letter ranging from a-z.
this, for example, will match any 8 char long strings containing only alphanumeric characters.
\w{8}
i often forget what some regex classes are, and it may be useful to you as a learning tool, but i use this as a point of reference: http://regexr.com/
It can also validate what you're typing inline via a text field so you can see if what you're doing works or not.
If you need more of a tutorial than a reference, i found this extremely useful when i learned: regexone.com

Getting options into a Perl program

I've been googling for some time now and strangely didn't find anything that answers my question.
I want to pass -n as an option to the program, where n is an integer.
This is what I have:
if($ARGV[0] eq "-A_NUMBER")
Is there some type of wildcard I can use for this? So the condition is true for any minus sign followed by any integer (or character)?
I'd reccomend having a read about Getopt::Std it is more long winded than just inspecting #ARGV, but more robust
Use
if ($ARGV[0] =~ /^-\d/)
This regular expression matches a minus sign then a number.
"^" anchors the match to the start of the line and "\d" is an escape character that represents a number

I want to find and replace an ordered list in word from the . to the )

I have tried [0-9] and checked the use wildcard box but it replaces the individual numbers with the literal [0-9] string. How do I replace with the number it found plus a character?
Backreferences. Your unspecified environment may or may not support them, but if it does, you would:
replace \([0-9]*\)
with \1 <then, whatever the character you want is>

How to get a perfect match for a regexp pattern in Perl?

I've to match a regular-expression, stored in a variable:
#!/bin/env perl
use warnings;
use strict;
my $expr = qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx;
$str = "abcd[3] xyzg[4:0]";
if ($str =~ m/$expr/) {
print "\n%%%%%%%%% $`-----$&-----$'\n";
}
else {
print "\n********* NOT MATCHED\n";
}
But I'm getting the outout in $& as
%%%%%%%%% -----abcd[3] xyzg-----[4:0]
But expecting, it shouldn't go inside the if clause.
What is intended is:
if $str = "abcd xyzg" => %%%%%%%%% -----abcd xyzg----- (CORRECT)
if $str = "abcd[2] xyzg" => %%%%%%%%% -----abcd[2] xyzg----- (CORRECT)
if $str = "abcd[2] xyzg[3] => %%%%%%%%% -----abcd[2] xyzg[3]----- (CORRECT)
if $str = "abcd[2:0] xyzg[3] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2:0] xyzg[3:0] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2] xyzg[3:0]" => ********* NOT MATCHED (CORRECT/INTENDED)
but output is %%%%%%%%% -----abcd[2] xyzg-----[3:0] (WRONG)
OR better to say this is not intended.
In this case, it should/my_expectation go to the else block.
Even I don't know, why $& take a portion of the string (abcd[2] xyzg), and $' having [3:0]?
HOW?
It should match the full, not a part like the above. If it didn't, it shouldn't go to the if clause.
Can anyone please help me to change my $expr pattern, so that I can have what is intended?
By default, Perl regexes only look for a matching substring of the given string. In order to force comparison against the entire string, you need to indicate that the regex begins at the beginning of the string and ends at the end by using ^ and $:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)$/;
(Also, there's no reason to have the /x modifier, as your regex doesn't include any literal whitespace or # characters, and there's no reason for the /s modifier, as you're not using ..)
EDIT: If you don't want the regex to match against the entire string, but you want it to reject anything in which the matching portion is followed by something like "[0:0]", the simplest way would be to use lookahead:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\]|(?=[^[\w])|$ ))/x;
This will match anything that takes the following form:
beginning of the string (which your example in the comments seems to imply you want)
zero or more whitespace characters
one or more word characters
optional: [, one or more digits, ]
one or more whitespace characters
one or more word characters
one of the following, in descending order of preference:
[, one or more digits, ]
an empty string followed by (but not including!) a character that is neither [ nor a word character (The exclusion of word characters is to keep the regex engine from succeeding on "a[0] bc[1:2]" by only matching "a[0] b".)
end of string (A space is needed after the $ to keep it from merging with the following ) to form the name of a special variable, and this entails the reintroduction of the /x option.)
Do you have any more unstated requirements that need to be satisfied?
The short answer is your regexp is wrong.
We can't fix it for you without you explaining what you need exactly, and the community is not going to write a regexp exactly for your purpose because that's just too localized a question that only helps you this one time.
You need to ask something more general about regexps that we can explain to you, that will help you fix your regexp, and help others fix theirs.
Here's my general answer when you're having trouble testing your regexp. Use a regexp tool, like the regex buddy one.
So I'm going to give a specific answer about what you're overlooking here:
Let's make this example smaller:
Your pattern is a(bc+d)?. It will match: abcd abccd etc. While it will not match bcd nor bzd in the case of abzd it will match as matching only a because the whole group of bc+d is optional. Similarly it will match abcbcd as a dropping the whole optional group that couldn't be matched (at the second b).
Regexps will match as much of the string as they can and return a true match when they can match something and have satisfied the entire pattern. If you make something optional, they will leave it out when they have to including it only when it's present and matches.
Here's what you tried:
qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx
First, s and x aren't needed modifiers here.
Second, this regex can match:
Any or no whitespace followed by
a word of at least one alpha character followed by
optionally a grouped square bracketed number with at least one digit (eg [0] or [9999]) followed by
at least one white space followed by
a word of at least one alpha character followed by
optionally a square bracketed number with at least one digit.
Clearly when you ask it to match abcd[0] xyzg[0:4] the colon ends the \d+ pattern but doesn't satisfy the \] so it backtracks the whole group, and then happily finds the group was optional. So by not matching the last optional group, your pattern has matched successfully.