Why this line stops Sphinx search? - sphinx

I use sanitizing from example: Barryhunter's
But when I use the line:
$q = preg_replace('/[^\w~\|\(\)\^\$\?"\/=-]+/',' ',trim(strtolower($q)));
then Russian search don't works! Only English.
What the reason? How I should use sanitizing?
This is my piece:
<HTML>
<BODY>
<form action="" method="get">
<input name="q" size="40" value="<?php echo #$_GET['q']; ?>" />
<input type="submit" value="Search" />
</form>
<?php
require ( 'sphinxapi.php' );
$sphinx = new SphinxClient;
$sphinx->SetServer('ununtu', 9312);
$sphinx->open();
$sphinx->SetMatchMode (SPH_MATCH_EXTENDED);
$sphinx->setFieldWeights(array(
'title' => 10,
'content' => 5
));
$sphinx->SetRankingMode(PH_RANK_WORDCOUNT);
$sphinx->SetSortMode(SPH_SORT_RELEVANCE);
$sphinx->setLimits(0, 10, 200);
$sphinx->resetFilters();
$q = isset($_GET['q'])?$_GET['q']:'';
$q = preg_replace('/ OR /',' | ',$q);
// $q = preg_replace('/[^\w~\|\(\)\^\$\?"\/=-]+/',' ',trim(strtolower($q)));
if(isset($_GET['q']) and strlen($_GET['q']) > 1)
{
$result = $sphinx->query($sphinx->escapeString($q), '*');
...

Assuming your input string is utf-encoded you use non-unicode preg_replace. Add 'u' in the end, e.g.:
$q = preg_replace('/[^\w~\|\(\)\^\$\?"\/=-]+/u',' ',trim(strtolower($q)));

Specifically that regex is stripping anything that is not a 'word' char, or a predefined list of syntax/punctuation chars.
The PREG definition of word (the \w ) is
A "word" character is any letter or digit or the underscore character,
that is, any character which can be part of a Perl "word". The
definition of letters and digits is controlled by PCRE's character
tables, and may vary if locale-specific matching is taking place. For
example, in the "fr" (French) locale, some character codes greater
than 128 are used for accented letters, and these are matched by \w.
http://php.net/manual/en/regexp.reference.escape.php
So possibly in English locale (or other western European for example), hence many Russian chars are not considered a word char, and stripped.
(if your pages are in UTF8, then may also need the /u as mentioned by other answer)

Related

Google forms Regular Expressions

i'm creating a survey in google forms and cant find any regular expressions for a pin code entry.
The User is being asked a question and can enter 2 pin codes in two text fields.
I need the Regular expression that contains 4 digits with numbers from 0-9.
Example:
Textbox1: 1234
Textbox2: 4321
Any ideas?
Try \d{4}
Also set your Regular Expression to Matches
[0-9]{4}
This should be your regular expression.
function validate() {
var textField = document.getElementById("textbox1").value;
var regex = /[0-9]{4}/g;
alert("Valid input: " + regex.test(textField));
}
<input type="text" id="textbox1">
<input type="button" onclick="validate()" value="Validate">

Is there a function to decode encoded unicode utf-8 string like from a form?

I want to store some data with a html form and Rebol cgi. My form looks like this:
<form action="test.cgi" method="post" >
Input:
<input type="text" name="field"/>
<input type="submit" value="Submit" />
</form>
But for unicode characters like Chinese, I get the encoded form of the data with percent signs, for instance %E4%BA%BA.
(This is for the Chinese character "人" ... its UTF-8 form as a Rebol binary literal is #{E4BABA})
Is there a function in the system, or an existing library that can decode this directly? dehex does not appear to currently cover this case. I'm currently decoding this manually by removing the percent signs and constructing the corresponding binary, like this:
data: to-string read system/ports/input
print data
;-- this prints "field=%E4%BA%BA"
k-v: parse data "="
print k-v
;-- this prints ["field" "%E4%BA%BA"]
v: append insert replace/all k-v/2 "%" "" "#{" "}"
print v
;-- This prints "#{E4BABA}" ... a string!, not binary!
;-- LOAD will help construct the corresponding binary
;-- then TO-STRING will decode that binary from UTF-8 to character codepoints
write %test.txt to-string load v
I have a library called AltWebForm that en/decodes percent-encoded web form data:
do http://reb4.me/r3/altwebform
load-webform "field=%E4%BA%BA"
The library is described here: Rebol and Web Forms.
Looks to be related to ticket #1986, where it is discussed whether this is a "bug" or the Internet changing out from under its own spec:
Have DEHEX convert UTF-8 sequences from browsers as Unicode.
If you have specific experience on what has become standard in Chinese, and want to weigh in, that would be valuable.
Just as an aside, the specific case above could have been handled in PARSE alternately as:
key-value: {field=%E4%BA%BA}
utf8-bytes: copy #{}
either parse key-value [
copy field-name to {=}
skip
some [
and {%}
copy enhexed-byte 3 skip (
append utf8-bytes dehex enhexed-byte
)
]
] [
print [field-name {is} to string! utf8-bytes]
] [
print {Malformed input.}
]
That will output:
field is 人
With some comments included:
key-value: {field=%E4%BA%BA}
;-- Generate empty binary value by copying an empty binary literal
utf8-bytes: copy #{}
either parse key-value [
;-- grab field-name as the chars right up to the equals sign
copy field-name to {=}
;-- skip the equal sign as we went up to it, without moving "past" it
skip
;-- apply the enclosed rule SOME (non-zero) number of times
some [
;-- match a percent sign as the immediate next symbol, without
;-- advancing the parse position
and {%}
;-- grab the next three chars, starting with %, into enhexed-byte
copy enhexed-byte 3 skip (
;-- If we get to this point in the match rule, this parenthesized
;-- expression lets us evaluate non-dialected Rebol code to
;-- append the dehexed byte to our utf8 binary
append utf8-bytes dehex enhexed-byte
)
]
] [
print [field-name {is} to string! utf8-bytes]
] [
print {Malformed input.}
]
(Note also that "simple parse" is getting the axe in favor of enhancements to SPLIT. So writing code like parse data "=" can now be expressed instead as split data "=", or other cool variants if you check them out...samples are in the ticket.)

Multilingual text sorting in Perl, on Windows, using locale

I am building a piece of software for sorting book indexes in different languages. It uses Perl, and keys off of the locale. I am developing it on Unix, but it needs to be portable to Windows. Should this work in principle, or by relying on locale, am I barking up the wrong tree? Bottom line, Windows is really where I need this to work, but I am more comfortable developing in my UNIX environment.
Assuming that your starting point is Unicode, because you have been very careful to decode all incoming data no matter what its native encoding might be, then it is easy to use to the Unicode::Collate module as a starting point.
If you want locale tailoring, then you probably want to start with Unicode::Collate::Locale instead.
Decoding into Unicode
If you run in an all-UTF8 environment, this is easy, but if you are subject to the vicissitudes of random so-called “locales” (or even worse, the ugly things Microsoft calls “code pages”), then you might want to get the CPAN Encode::Locale module to help you out. For example:
use Encode;
use Encode::Locale;
# use "locale" as an arg to encode/decode
#ARGV = map { decode(locale => $_) } #ARGV;
# or as a stream for binmode or open
binmode $some_fh, ":encoding(locale)";
binmode STDIN, ":encoding(console_in)" if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;
(If it were me, I would just use ":utf8" for the output.)
Standard Collation, plus locales and tailoring
The point is, once you have everything decoded into internal Perl format, you can use Unicode::Collate and Unicode::Collate::Locale on it. These can be really easy:
use v5.14;
use utf8;
use Unicode::Collate;
my #exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
#exes = Unicode::Collate->new->sort(#exes);
say "#exes";
# prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹
Or they can be pretty fancy. Here is one that tries to deal with book titles: it strips leading articles and zero-pads numbers.
my $collator = Unicode::Collate->new(
--upper_before_lower => 1,
--preprocess => {
local $_ = shift;
s/^ (?: The | An? ) \h+ //x; # strip articles
s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
return $_;
};
);
Now just use that object’s sort method to sort with.
Sometimes you need to turn the sort inside out. For example:
my $collator = Unicode::Collate->new();
for my $rec (#recs) {
$rec->{NAME_key} =
$collator->getSortKey( $rec->{NAME} );
}
#srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} #recs;
The reason you have to do that is because you are sorting on a record with various fields. The binary sort key allows you to use the cmp operator on data that has been through your chosen/custom collator object.
The full constructor for the collator object has all this for a formal syntax:
$Collator = Unicode::Collate->new(
UCA_Version => $UCA_Version,
alternate => $alternate, # alias for 'variable'
backwards => $levelNumber, # or \#levelNumbers
entry => $element,
hangul_terminator => $term_primary_weight,
highestFFFF => $bool,
identical => $bool,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
ignore_level2 => $bool,
katakana_before_hiragana => $bool,
level => $collationLevel,
minimalFFFE => $bool,
normalization => $normalization_form,
overrideCJK => \&overrideCJK,
overrideHangul => \&overrideHangul,
preprocess => \&preprocess,
rearrange => \#charList,
rewrite => \&rewrite,
suppress => \#charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool,
variable => $variable,
);
But you usually don’t have to worry about almost any of those. In fact, if you want country-specific locale tailoring using the CLDR data, you should just use Unicode::Collate::Locale, which adds exactly one more parameter to the constructor: locale => $country_code.
use Unicode::Collate::Locale;
$coll = Unicode::Collate::Locale->
new(locale => "fr");
#french_text = $coll->sort(#french_text);
See how easy that is?
But you can do other cool things, too.
use Unicode::Collate::Locale;
my $Collator = new Unicode::Collate::Locale::
locale => "de__phonebook",
level => 1,
normalization => undef,
;
my $full = "Ich müß Perl studieren.";
my $sub = "MUESS";
if (my ($pos,$len) = $Collator->index($full, $sub)) {
my $match = substr($full, $pos, $len);
say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";
}
When run, that says:
Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›
Here are the available locales as of v0.96 of the Unicode::Collate::Locale module, taken from its manpage:
locale name description
--------------------------------------------------------------
af Afrikaans
ar Arabic
as Assamese
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
bn Bengali
bs Bosnian
bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
ee Ewe
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fa Persian
fi Finnish (v and w are primary equal)
fi__phonebook Finnish (v and w as separate characters)
fil Filipino
fo Faroese
fr French
gu Gujarati
ha Hausa
haw Hawaiian
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
kn Kannada
ko Korean [2]
kok Konkani
ln Lingala
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
or Oriya
pa Punjabi
pl Polish
ro Romanian
ru Russian
sa Sanskrit
se Northern Sami
si Sinhala
si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sr_Latn Serbian in Latin (tailored as Croatian)
sv Swedish (v and w are primary equal)
sv__reformed Swedish (v and w as separate characters)
ta Tamil
te Telugu
th Thai
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
wae Walser
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order) [3]
zh__stroke Chinese (ideographs: stroke order) [3]
zh__zhuyin Chinese (ideographs: zhuyin order) [3]
Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
(Zulu).
Note
[1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and halfwidth forms are identical to their regular form. The
difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
and then "katakana_before_hiragana" has no effect.
[2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
(level 2) greater than, the corresponding hangul syllable.
[3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.
Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.
So in summary, the main trick is to get your local data decoded into a uniform Unicode representation, then use deterministic sorting, possibly tailored, that doesn’t rely on random settings of the user’s console window for correct behavior.
Note: All these examples, apart from the manpage citation, are lovingly lifted from the 4th edition of Programming Perl, by kind permission of its author. :)
Win32::OLE::NLS gives you access to that part of the system. It provides you CompareString and the necessary tools to obtain the necessary locale id.
In case you want/need to locate the system documentation, the underlying system call is named CompareStringEx.

sed/awk Capitallize everything between patterns and lowercase small words

I did find a way to capitalize the whole document, with both sed and awk, but how to do it, if I want to convert everything inside patterns from CAPS LOCK to Capital?
For example, I have an HTML file, and everything (multiple occurrences) between <b> and </b> has to be converted from TITLE to Title, and if possible making small words (1 ~ 2 letters) in lowercase.
From This:
<div id="1">
<div class="p"><b>THIS IS A RANDOM TITLE</b></div>
<table class="hugetable">
...
</table>
<div class="p"><b>THIS IS ANOTHER RANDOM TITLE</b></div>
<table class="hugetable">
...
</table>
...
</div>
To this:
<div id="1">
<div class="p"><b>This is a Random Title</b></div>
<table class="hugetable">
...
</table>
<div class="p"><b>This is Another Random Title</b></div>
<table class="hugetable">
...
</table>
...
</div>
This is not the most beautiful solution but I think it works:
sed -r -e '/<b>/ {s/( .)([^ ]*)/\1\L\2/g}' -e 's/<b>(.)/<b>\u\1/' -e '/<b>/ {s/(\b.{1,2}\b)/\L\1/g}' data
Explanation:
1st expression (-e): If a line contains <b>:
Then for each word which has a space in front of it, keep the space and the first (already capitalized) character (\1) and then convert all the following characters of the word to lower case (\L\2)
2nd expression (-e): The first word after <b> is still uncapitalized, so select the first character after the bold tag <b>(.) and replace it uppercased <b>\u\1
3rd expression (-e): Again if a line contains <b>:
Then select words of 1 or 2 characters in length \b.{1,2}\b and replace them lowercased \L\1

Clean string from html tags and special characters

I want to clean my text from html tags, html spacial characters and characters like < > [ ] / \ * ,
I used $str = preg_replace("/&#?[a-zA-Z0-9]+;/i", "", $str);
it works well with html special characters but some characters doesn't remove like :
( /*/*]]>*/ )
how can I remove these characters?
If you are really using php as it looks like, you can just use:
$str = htmlspecialchars($str);
All HTML chars will be escaped (which could be better than just stripping them). If you really want just to filter these characters, what you need to do is escape those characters on the chars list:
$str = preg_replace("/[\&#\?\]\[\/\\\<\>\*\:\(\);]*/i","",$str);
Notice there's just one "/[]*/i", I removed the a-zA-Z0-9 as you should want these chars in. You can also classify only the desired chars to enter your string (will give you trouble with accentuations like á é ü if you use them, you have to specify every accepted char):
$str = preg_replace("/[^a-zA-Z0-9áÁéÉíÍãÃüÜõÕñÑ\.\+\-\_\%\$\#\!\=;]*/","",$str);
Notice also there's never too much to escape characters, unless for example for the intervals (\a-\z would do fine, \a-\z would match a, or -, or z).
I hope it helps. :)
Regular expression for html tags is:
/\<(.*)?\>/
so use something like this:
// The regular expression to remove HTML tags
$htmltagsregex = '/\<(.*)?\>/';
// what shit will substitute it
$nothing = '';
// the string I want to apply it to
$string = 'this is a string with <b>HTML tags</b> that I want to <strong>remove</strong>';
// DO IT
$result = preg_replace ($htmltagsregex,nothing,$string);
and it will return
this is a string with HTML tags that I want to remove
That's all