Search special chars by space in sphinx - sphinx

i have problem with sphinx search.
I have string for indexing
xyz a'qwerty
I need to find it if i use
xy - ok
xy a - ok
xyz a'qwerty - ok
xyz a qwerty - ok
xyz a qwe - not ok
I rly can't reach right result, know someone how to do this?
My index look like this, regex_filters was some experiments so, can be removed.
index ProductSearch
{
source = ProductSearchSource
path = c:/wamp/sphinx/data/product
docinfo = extern
enable_star = 0
expand_keywords = 1
min_word_len = 2
min_prefix_len = 1
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, _, a..z, U+0022, U+0026, U+0027, U+0060, U+00B4, U+002E, U+0e1->a, U+0c1->a, U+10d->c, U+10c->c, U+10f->d, U+10e->d, U+0e9->e, U+0c9->e, U+11b->e, U+11a->e, U+0ed->i, U+0cd->i, U+148->n, U+147->n, U+0f3->o, U+0d3->o, U+159->r, U+158->r, U+161->s, U+160->s, U+165->t, U+164->t, U+0fa->u, U+0da->u, U+16f->u, U+16e->u, U+0fd->y, U+0dd->y, U+17e->z, U+17d->z,
wordforms = c:/wamp/www/project/configs/sphinx/synonyms
regexp_filter = (\w*)'(\w*) => \1'\2
regexp_filter = (\w*)'(\w*) => \1 \2
regexp_filter = (\w*)'(\w*) => \1
regexp_filter = (\w*)'(\w*) => \2
}
Using SPH_MATCH_EXTENDED2
PS.: Sorry for bad english

Problem solved, I missed synonyms in wordforms, it rewrites my tested word, so it looked like sphinx doesn't work correctly.. (Facepalm here)

Related

MATLAB filename separation

In filename "name" like '10_m1_m2_const_m1_waves_20_90_m2_waves_90_20_20200312_213048' I need to separate
'10_m1_m2_const_m1_waves_20_90_m2_waves_90_20' from '20200312_213048'
name_sep = split(name,"_");
sep = '_';
name_join=[name_sep{1,1} sep name_sep{2,1} sep .....];
is not working, because a number of "_" are variable.
So I need to move a file:
movefile([confpath,name(without 20200312_213048),'.config'],[name(without 20200312_213048), filesep, name, '.config']);
Do you have any idea? Thank you!
Maybe you can try regexp to find the starting position for the separation:
ind = regexp(name,'_\d+_\d+$');
name1 = name(1:ind-1);
name2 = name(ind+1:end);
such that
name1 = 10_m1_m2_const_m1_waves_20_90_m2_waves_90_20
name2 = 20200312_213048
Or the code below with option tokens:
name_sep = cell2mat(regexp(name,'(.*)_(\d+_\d+$)','tokens','match'));
which gives
name_sep =
{
[1,1] = 10_m1_m2_const_m1_waves_20_90_m2_waves_90_20
[1,2] = 20200312_213048
}
You can use strfind. Either if you have a key that is always present before or after the point where you want to split the name:
nm = '10_m1_m2_const_m1_waves_20_90_m2_waves_90_20_20200312_213048';
key = 'waves_90_20_';
idx = strfind(nm,key) + length(key);
nm(idx:end)
Or if you know how may _ are in the part that you want to have:
idx = strfind(nm,'_');
nm(idx(end-2)+1:end)
In both cases, the result is:
'20_20200312_213048'
As long as the timestamp is always at the end of the string, you can use strfind and count backwards from the end of the string:
name = '10_m1_m2_const_m1_waves_20_90_m2_waves_90_20_20200312_213048';
udscr = strfind(name,'_');
name_date = name(udscr(end-1)+1:end)
name_meta = name(1:udscr(end-1)-1)
name_date =
'20200312_213048'
name_meta =
'10_m1_m2_const_m1_waves_20_90_m2_waves_90_20'

sphinx: how to search for a phrase with wildcard?

I have a phrase "my name is bob". I want to match it by querying "my n".
How my query should look like? What config should I have?
min_prefix_len and min_prefix_len did not give any expecting results.
I had min_word_len set to 2, but changing it to 1 did not help either.
expand_keywords 1/2 had made no difference.
Here's my index config:
index track
{
source = track
path = /var/lib/sphinx/track
min_word_len = 1
docinfo = extern
mlock = 1
morphology = none
expand_keywords = 1
}
The queries i tried:
"my n*"
"my n"*
my n
"my n" | my n*
"my n" | "my n*" | my n*
No matter what, I cannot match "my name ...".
min_word_len = 1
min_prefix_len = 1
expand_keywords = 0
Need min_prefix_len to enable wildcard searches. But want expand_keywords off, as that makes all keywords have wildcards on them.
Then can just do
"my n*"

Change a number arithmetically in a text file using perl

I have a bunch of numbers in a text file as follows (example
r0 = 204
r1 = 205
max_gap = 20u
min = 0
max = 8
thickness = 2
color = green
fill_under = yes
fill_color = green
r0 = 205
r1 = 206
I would like to divide any line with r0 = by 100 so that the line will then read
r0 = 20.4
I would like to do this for all lines with r0 and also for r1. Is there a way to do this in perl?
This is my attempt but doesnt work mainly because I've never used perl before which is why I'm asking such a simple question
#!/usr/bin/perl
$string= r0\s+=\s+\\(d+)
$num= $1/100
$num2= r0\s+=\s+\\$num
s/$string/$num2;
A one liner I could run from bash would be much better though. I know it'll involve the s/find/replace function but not sure how to specify the integer part
perl -pei 's#^(r[01]\s*=\s*)(\d+)$#$1.$2/100#e' filename
The options mean:
-p = Run the code in a loop that prints the modified input
-e = Execute the code in the first argument
-i = Replace the input file(s) with the output
The regular expression bits mean:
^ = beginning of line
r[01] = r0 or r1
\s*=\s* = any amount of whitespace, an =, and any amount of whitespace
\d+ = digits
$ = end of line
The replacement uses the e modifier, which means that it should be executed as a Perl expression. $1 and $2 are the contents of the two capture groups: $1 is everything before the number, $2 is the number. $2/100 divides the number by 100, and . concatenates the two pieces together.
As a one-liner:
perl -pi -e 's{^r[01]\s*=\s*\K(\d+)$}{$1/10}e' filename.txt
Here is an awk solution:
awk '/^r[01]/ {$3/=100} 1' file
r0 = 2.04
r1 = 2.05
max_gap = 20u
min = 0
max = 8
thickness = 2
color = green
fill_under = yes
fill_color = green
r0 = 2.05
r1 = 2.06

Sphinx autocomplete search

I'm trying to do a google-style autocomplete search with sphinx and ajax.
Say user is looking for an iphone. The goal is that input like "ip", "iph", "ipho" must give me the result, but it does not, while "iphon" or "iphone" do.
So, what am i doing wrong here?
index product
{
source = product
path = /var/lib/sphinx/product
docinfo = extern
mlock = 0
morphology = stem_enru
min_word_len = 2
charset_type = utf-8
charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
min_prefix_len = 2
max_substring_len = 6
enable_star = 1
}
and the query
$sphinx = new SphinxClient();`
$sphinx -> SetLimits (0,1500,2500);
$sphinx->SetServer('localhost', 9312);
$sphinx->SetMatchMode(SPH_MATCH_EXTENDED);
$sphinx->SetSortMode(SPH_SORT_RELEVANCE);
$sphinx->SetFieldWeights(array ('name' => 30, 'brand' => 20, 'parent_name' => 10, 'description' => 5));
$result = $sphinx->Query($string, '*');

Renaming a Word document and saving its filename with its first 10 letters

I have recovered some Word documents from a corrupted hard drive using a piece of software called photorec. The problem is that the documents' names can't be recovered; they are all renamed by a sequence of numbers. There are over 2000 documents to sort through and I was wondering if I could rename them using some automated process.
Is there a script I could use to find the first 10 letters in the document and rename it with that? It would have to be able to cope with multiple documents having the same first 10 letters and so not write over documents with the same name. Also, it would have to avoid renaming the document with illegal characters (such as '?', '*', '/', etc.)
I only have a little bit of experience with Python, C, and even less with bash programming in Linux, so bear with me if I don't know exactly what I'm doing if I have to write a new script.
How about VBScript? Here is a sketch:
FolderName = "C:\Docs\"
Set fs = CreateObject("Scripting.FileSystemObject")
Set fldr = fs.GetFolder(Foldername)
Set ws = CreateObject("Word.Application")
For Each f In fldr.Files
If Left(f.name,2)<>"~$" Then
If InStr(f.Type, "Microsoft Word") Then
MsgBox f.Name
Set doc = ws.Documents.Open(Foldername & f.Name)
s = vbNullString
i = 1
Do While Trim(s) = vbNullString And i <= doc.Paragraphs.Count
s = doc.Paragraphs(i)
s = CleanString(Left(s, 10))
i = i + 1
Loop
doc.Close False
If s = "" Then s = "NoParas"
s1 = s
i = 1
Do While fs.FileExists(s1)
s1 = s & i
i = i + 1
Loop
MsgBox "Name " & Foldername & f.Name & " As " & Foldername & s1 _
& Right(f.Name, InStrRev(f.Name, "."))
'' This uses copy, because it seems safer
f.Copy Foldername & s1 & Right(f.Name, InStrRev(f.Name, ".")), False
'' MoveFile will copy the file:
'' fs.MoveFile Foldername & f.Name, Foldername & s1 _
'' & Right(f.Name, InStrRev(f.Name, "."))
End If
End If
Next
msgbox "Done"
ws.Quit
Set ws = Nothing
Set fs = Nothing
Function CleanString(StringToClean)
''http://msdn.microsoft.com/en-us/library/ms974570.aspx
Dim objRegEx
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.IgnoreCase = True
objRegEx.Global = True
''Find anything not a-z, 0-9
objRegEx.Pattern = "[^a-z0-9]"
CleanString = objRegEx.Replace(StringToClean, "")
End Function
Word documents are stored in a custom format which places a load of binary cruft on the beginning of the file.
The simplest thing would be to knock something up in Python that searched for the first line beginning with ASCII chars. Here you go:
#!/usr/bin/python
import glob
import os
for file in glob.glob("*.doc"):
f = open(file, "rb")
new_name = ""
chars = 0
char = f.read(1)
while char != "":
if 0 < ord(char) < 128:
if ord("a") <= ord(char) <= ord("z") or ord("A") <= ord(char) <= ord("Z") or ord("0") <= ord(char) <= ord("9"):
new_name += char
else:
new_name += "_"
chars += 1
if chars == 100:
new_name = new_name[:20] + ".doc"
print "renaming " + file + " to " + new_name
f.close()
break;
else:
new_name = ""
chars = 0
char = f.read(1)
if new_name != "":
os.rename(file, new_name)
NOTE: if you want to glob multiple directories you'll need to change the glob line accordingly. Also this takes no account of whether the file you're trying to rename to already exists, so if you have multiple docs with the same first few chars then you'll need to handle that.
I found the first chunk of 100 ASCII chars in a row (if you look for less than that you end up picking up doc keywords and such) and then used the first 20 of these to make the new name, replacing anything that's not a-z A-Z or 0-9 with underscores to avoid file name issues.