How I should tune Sphinx? - sphinx

I make my first FullTextSearch app. Today I at last began test.
'test'
... the Whig national **ticket** victorious, he ... Democrats who **thought** he was ...
... also a self-**tought** architect. He ... and he always **thought** of how ...
... HTML (Hyper **Text** Mark-up ... Сервер HTML Hyper **Text** Markup Language. ...
... ,0 2. Aiced **test** ratio (Quick ratio ...
... on the first **Tuesday**, after the first ...
'stop'
... ],Elm); OutText(Elm); **Stop**:=False; End; '2 ...
.. a crucial **step** in the ... is an increasingly **steep** maturity-related...
... CHIPSET FEATURES **SETUP** или INTEGRATED ... CHIPSET FEATURES **SETUP** или ...
... Trisetum, Anisantna, **Stipa** и ... многие виды **Stipa**, Stipagrostis), что ...
My config:
source src1
{
type = csvpipe
csvpipe_command = /usr/bin/php /var/www/html/import.php
csvpipe_field_string = title
csvpipe_field_string = content
csvpipe_attr_string = path
}
index test1
{ source = src1
path = /var/lib/sphinxsearch/data/test1
mlock = 0
# morphology = stem_en, stem_ru, soundex
min_word_len = 2
html_strip = 0
}
I commented the morphology string and reload Sphinx, but result the same. It looks like morphology still works for me.

Possibly the biggest thing to look at is
morphology = stem_en, stem_ru, soundex
Morphology is a very powerful function, in that it 'morphs' words going into the index (and in the query, so can match!), using various rules.
In your case you've enabled stemming, which 'normalizes' word endings, BUt auso have soundex, which is a 'sounds similar' algorithm. I believe designed for English, so dont know how well it does on Russian!
All of these means going to get 'similar' matches, not just exact word matches.
(test is just a word that happens to sound similar to other words, your hyper example, is a more unique sounding word)
Also realise its probably jsut a test script, but can pass multiple documents at once to buildExcerpts. So its should be MUCH more efficient, to compile the documents and call buildExcepts just once.
But even more interesting, is as you sourcing the the text from a sphinx attribute, can use the SNIPPETS() sphinx function (in setSelect()!) in the main query. SO you dont have to receive the full text, just to send back to sphinx. ie sphinx will fetch the text from attribute internally. even more efficient!

Related

Word - delete text between <> including tables

I'm trying to delete text between < and > that includes 2 tables. I can do text including multiple lines using wildcard search and replace using (\<)(*)(>)
but this doesn't work when the text includes tables. Any ideas? There are varying numbers of lines in the tables too.
The correct wildcard Find expression would be:
\<*\>
Nevertheless, your observation is correct: It won't find content that includes a table between the < and >. You would need to use two Find/Replace operations, one that uses the above expression then another that employs a loop, looking for
<
Then extending the found range until:
>
is encountered.
Got the solution: https://answers.microsoft.com/en-us/msoffice/forum/msoffice_word-msoffice_custom-mso_2016/delete-text-between-including-tables/51f09dcb-8c77-41d3-840c-e8e0545f313a?tm=1531844975462&auth=1
Dim rng As Range
Selection.HomeKey wdStory
With Selection.Find
Do While .Execute(findText:="<", Forward:=True, _
MatchWildcards:=False, Wrap:=wdFindStop, MatchCase:=True) = True
Set rng = Selection.Range
rng.End = ActiveDocument.Range.End
rng.End = rng.Start + InStr(rng, ">")
rng.Select
Selection.Delete
Loop
End With
End Sub

Adding specific letters to a string MATLAB

I'm a neuroscience/biomedical engineering major struggling with this whole MATLAB programming ordeal and so far, this website is the best teacher available to me right now. I am currently having trouble with one of my HW problems. What I need to do is take a phrase, find a specific word in it, then take a specific letter in it and increase that letter by the number indicated. In other words:
phrase = 'this homework is so hard'
word = 'so'
letter = 'o'
factor = 5
which should give me 'This homework is sooooo hard'
I got rid of my main error, though I really don;t know how. I exited MATLAB, then got back into it. Lo and behold, it magically worked.
function[out1] = textStretch(phrase, word, letter, stretch)
searchword= strfind(phrase, word);
searchletter strfind(hotdog, letter); %Looks for the letter in the word
add = (letter+stretch) %I was hoping this would take the letter and add to it, but that's not what it does
replace= strrep(phrase, word, add) %This would theoretically take the phrase, find the word and put in the new letter
out1 = replace
According to the teacher, the ones() function might be useful, and I have to concatenate strings, but if I can just find it in the string and replace it, why do I need to concatenate?
Since this is homework I won't write the whole thing out for you but you were on the right track with strfind.
a = strfind(phrase, word);
b = strfind(word, letter);
What does phrase(1:a) return? What does phrase(a+b:end) return?
Making some assumptions about why your teacher wants you to use ones:
What does num = double('o') return?
What does char(num) return? How about char([num num])?
You can concatenate strings like this:
out = [phrase(1:a),'ooooo',phrase(a+b:end)];
So all you really need to focus on is how to get a string which is letter repeated factor times.
If you wanted to use strrep instead you would need to give it the full word you are searching for and a copy of that word with the repeated letters in:
new_phrase = strrep(phrase, 'so', 'sooooo');
Again, the issue is how to get the 'sooooo' string.
See if this works for you -
phrase_split = regexp(phrase,'\s','Split'); %// Split into words as cells
wordr = cellstr(strrep(word,letter,letter(:,ones(1,factor))));%// Stretched word
phrase_split(strcmp(phrase_split,word)) = wordr;%//Put stretched word into place
out = strjoin(phrase_split) %// %// Output as the string cells joined together
Note: strjoin needs a recent MATLAB version, which if unavailable could be obtained from here.
Or you can just use a hack obtained from the m-file itself -
out = [repmat(sprintf(['%s', ' '], phrase_split{1:end-1}), ...
1, ~isscalar(phrase_split)), sprintf('%s', phrase_split{end})]
Sample run -
phrase =
this homework is so hard and so boring
word =
so
letter =
o
factor =
5
out =
this homework is sooooo hard and sooooo boring
So, just wrap the code into a function wrapper like this -
function out = textStretch(phrase, word, letter, factor)
Homework molded edit:
phrase = 'this homework is seriously hard'
word = 'seriously'
letter = 'r'
stretch = 6
out = phrase
stretched_word = letter(:,ones(1,stretch))
hotdog = strfind(phrase, word)
hotdog_st = strfind(word,letter)
start_ind = hotdog+hotdog_st-1
out(start_ind+stretch:end+stretch-1) = out(start_ind+1:end)
out(hotdog+hotdog_st-1:hotdog+hotdog_st-1+stretch-1) = stretched_word
Output -
out =
this homework is serrrrrriously hard
As again, use this syntax to convert to function -
function out = textStretch(phrase, word, letter, stretch)
Well Jessica first of all this is WRONG, but I am not here to give you the solution. Could you please just use it this way? This surely run.
function main_script()
phrase = 'this homework is so hard';
word = 'so';
letter = 'o';
factor = 5;
[flirty] = textStretchNEW(phrase, word, letter, factor)
end
function [flirty] = textStretchNEW(phrase, word, letter, stretch)
hotdog = strfind(phrase, word);
colddog = strfind(hotdog, letter);
add = letter + stretch;
hug = strrep(phrase, word, add);
flirty = hug
end

ignore spaces and cases MATLAB

diary_file = tempname();
diary(diary_file);
myFun();
diary('off');
output = fileread(diary_file);
I would like to search a string from output, but also to ignore spaces and upper/lower cases. Here is an example for what's in output:
the test : passed
number : 4
found = 'thetest:passed'
a = strfind(output,found )
How could I ignore spaces and cases from output?
Assuming you are not too worried about accidentally matching something like: 'thetEst:passed' here is what you can do:
Remove all spaces and only compare lower case
found = 'With spaces'
found = lower(found(found ~= ' '))
This will return
found =
withspaces
Of course you would also need to do this with each line of output.
Another way:
regexpi(output(~isspace(output)), found, 'match')
if output is a single string, or
regexpi(regexprep(output,'\s',''), found, 'match')
for the more general case (either class(output) == 'cell' or 'char').
Advantages:
Fast.
robust (ALL whitespace (not just spaces) is removed)
more flexible (you can return starting/ending indices of the match, tokenize, etc.)
will return original case of the match in output
Disadvantages:
more typing
less obvious (more documentation required)
will return original case of the match in output (yes, there's two sides to that coin)
That last point in both lists is easily forced to lower or uppercase using lower() or upper(), but if you want same-case, it's a bit more involved:
C = regexpi(output(~isspace(output)), found, 'match');
if ~isempty(C)
C = found; end
for single string, or
C = regexpi(regexprep(output, '\s', ''), found, 'match')
C(~cellfun('isempty', C)) = {found}
for the more general case.
You can use lower to convert everything to lowercase to solve your case problem. However ignoring whitespace like you want is a little trickier. It looks like you want to keep some spaces but not all, in which case you should split the string by whitespace and compare substrings piecemeal.
I'd advertise using regex, e.g. like this:
a = regexpi(output, 'the\s*test\s*:\s*passed');
If you don't care about the position where the match occurs but only if there's a match at all, removing all whitespaces would be a brute force, and somewhat nasty, possibility:
a = strfind(strrrep(output, ' ',''), found);

NET::LDAP FIlter with OR

In PERL, NET::LDAP, I'm trying to use-
my $psearch-$ldap->search(
base => $my_base,
attr => ["mail","employeeNumber","physicalDeliveryOfficeName"],
filter => "(&(mail=*)(!(employeeNumber=9*)) (&(physicalDeliveryOfficeName=100)) (|(physicalDeliveryOfficeName=274)))");
Saying "give me everyone with a mail entry, where employee number does not begin with 9 and the physicalDeliveryOfficeName is either 100 or 274".
I can get it to work using just 100 or using just 274 but I can't seem to figure out how to specify 100 OR 274.
I can't seem to find the correct filter string, ready pull my hair out... please help!!
I can't test this, but LDAP queries use prefix notation while we're use to using infix notation. Imagine if you want something that's either a dog or a cat. In infix notation, it would look something like this:
((ANIMAL = "cat") OR (ANIMAL = "dog"))
With prefix notation, the boolean operator goes at the beginning of the query:
(OR (ANIMAL = "cat") (ANIMAL = "dog"))
The advantage to prefixed notation comes when you do more than two checks per boolean. Here I'm looking for something that's either a cat, a dog or a wombat:
(OR (ANIMAL = "cat") (ANIMAL = "dog") (ANIMAL = "wombat"))
Notice that I only needed a single boolean operator in the front of my statement. This will OR together all three statements. With our standard infix notation, I would have to have a second OR operator:
((ANIMAL = "cat") OR (ANIMAL = "dog") OR (ANIMAL = "wombat"))
Prefix notation was created by a Polish Mathematician named Jan Lukasiewicz back in 1924 in Warsaw Univeristy and thus became known as Polish Notation. Later on, it was discovered that computers could work an equation from front to back if the equation was written in postfix notation which is the reverse of Polish Notation. Thus, Reverse Polish Notation (or RPN) was born.
Early HP calculators used RPN notation which became the Geek Sheik thing back in the early 1970s. Imagine the sense of brain superiority you get when you hand your calculator to someone and they have no early idea how to use it. The only way to be cooler back then was to have a Curta.
Okay, enough walking down nostalgia lane. Let's get back to the problem...
The easiest way to construct an infix operation is to build a tree diagram of what you want. Thus, you should sketch out your LDAP query as a tree:
AND
/ | \
/ | \
/ | \
/ | \
/ | \
/ | \
/ | \
OR employee!=9* mail=*
/ \
/ \
/ \
/ \
/ \
phyDelOfficeName=100 phyDelOfficeName=274
To build a query based upon this tree, start with the bottom of the tree, and work your way up each layer. The bottom part of our tree is the OR part of our query:
(OR (physicalDeliveryOfficeName = 100) (physicalDeliveryOfficeName = 274))
Using LDAP's OR operator, the pipe (|) and removing the extra spaces, we get:
(|(physicalDeliveryOfficeName = 100)(physicalDeliveryOfficeName = 274))
When I build an LDAP query, I like to save each section as a Perl scalar variable. It makes it a bit easier to use:
$or_part = "|(physicalDeliveryOfficeName=100)(physicalDeliveryOfficeName=274)";
Notice I've left off the outer pair or parentheses. The outer set of parentheses return when you string all the queries back together. However, some people put them anyway. An extra set of parentheses doesn't hurt an LDAP query.
Now for the other two parts of the query:
$mailAddrExists = "mail=*";
$not_emp_starts_9 = "!(employee=9*)";
And, now we AND all three sections together:
"(&($mailAddrExists)($not_emp_starts_9)($or_part))"
Note that a single ampersand weaves it all together. I can substitute back each section to see the full query:
(&(mail=*)(!(employee=9*))(|(physicalDeliveryOfficeName=100)(physicalDeliveryOfficeName=274)))
Or like this:
my $psearch-$ldap->search(
base => $my_base,
attr => ["mail","employeeNumber","physicalDeliveryOfficeName"],
filter => "(&(mail=*)(!(employee=9*))(|(physicalDeliveryOfficeName=100)(physicalDeliveryOfficeName=274)))",
);
Or piecemeal:
my $mail = "mail=*";
my $employee = "!(employee=9*)";
my $physicalAddress = "|(physicalDeliveryOfficeName=100)"
. "(physicalDeliveryOfficeName=274)";
my $psearch-$ldap->search(
base => $my_base,
attr => ["mail","employeeNumber","physicalDeliveryOfficeName"],
filter => "(&($mail)($employee)($physicalAddress))",
);
As I said before, I can't test this. I hope it works. If nothing else, I hope you understand how to create an LDAP query and can figure out how to do it yourself.

Translate VBA syntax to Matlab for Activex control of Word document

I am a newbie to using activex controls in matlab. Am trying to control a word document. I need help translating between VBA syntax and Matlab, I think. How would one code the following in matlab?
Sub macro()
With CaptionLabels("Table")
.NumberStyle = wdCaptionNumberStyleArabic
.IncludeChapterNumber = True
.ChapterStyleLevel = 1
.Separator = wdSeparatorHyphen
End With
Selection.InsertCaption Label:="Table", TitleAutoText:="", Title:="", _
Position:=wdCaptionPositionAbove, ExcludeLabel:=0
End Sub
Thanks, I looked at the help and the source but I am still feeling dense. I want to be able to control caption numbering and caption text in an automated report. Am using Tables and figures. I just can't quite get my head around how to code the addition of the captions.
The following code gets me part way there. But I don't have control over numbering style, etc,. I have tried to figure out the activex structure but I can't make sense of it. In particular, In particular the first bit the VB subroutine above.
% Start an ActiveX session with Word
hdlActiveX = actxserver('Word.Application');
hdlActiveX.Visible = true;
hdlWordDoc = invoke(hdlActiveX.Documents, 'Add');
hdlActiveX.Selection.InsertCaption('Table',captiontext);
After some fiddling, I think I got it to work:
%# open Word
Word = actxserver('Word.Application');
Word.Visible = true;
%# create new document
doc = Word.Documents.Add;
%# set caption style for tables
t = Word.CaptionLabels.Item(2); %# 1:Figure, 2:Table, 3:Equation
t.NumberStyle = 0; %# wdCaptionNumberStyleArabic
t.IncludeChapterNumber = false;
t.ChapterStyleLevel = 1;
t.Separator = 0; %# wdSeparatorHyphen
%# insert table caption for current selection
Word.Selection.InsertCaption('Table', '', '', 0, false) %# wdCaptionPositionAbove
%# save document, then close
doc.SaveAs2( fullfile(pwd,'file.docx') )
doc.Close(false)
%# quit and cleanup
Word.Quit
Word.delete
Refer to the MSDN documentation to learn how to use this API. For example, the order of arguments of the InsertCaption function used above.
Note that I had to set IncludeChapterNumber to false, otherwise Word was printing "Error! No text of specified style in document" inside the caption text...
Finally, to find out the integer values of the wd* enums, I am using the ILDASM tool to disassemble the Office Interop assemblies (as this solution suggested). Simply dump the whole thing to text file, and search for the strings you are looking for.
Have a look at the help for actxserver and the source code for xlsread.m in the base MATLAB toolbox. If you're still stuck, then update your question with your progress.
EDIT:
You'll need to check the VBA help, but the first part ought to be possible via something like:
o = hdlWordDoc.CaptionLabels('Table');
o.NumberStyle = <some number corresponding to wdCaptionNumberStyleArabic>;
o.IncludeChapterNumber = true;
o.ChapterStyleLevel = 1;
o.Separator = <some number corresponding to wdSeparatorHyphen>;
In my experience, you have to get the values from the enumerations, such as wdCaptionNumberStyleArabic and wdSeparatorHyphen from a VBA script then hard-code them. You can try the following, but I don't think it works:
o.NumberStyle = 'wdCaptionNumberStyleArabic';
o.Separator = 'wdSeparatorHyphen';
Instead of hard-coding the text values into the code, you can use the enum constants. This will help when a different language of Word is installed.
A list of the Enums can be found here: https://learn.microsoft.com/en-us/office/vba/api/word(enumerations)
So instead of:
Word.Selection.InsertCaption('Table', '', '', 0, false) %# wdCaptionPositionAbove
you can use this:
NET.addAssembly('Microsoft.Office.Interop.Word')
Word.Selection.InsertCaption(...
Microsoft.Office.Interop.Word.WdCaptionLabelID.wdCaptionTable.GetHashCode,...
' My custom table caption text', '', ...
Microsoft.Office.Interop.Word.WdCaptionPosition.wdCaptionPositionAbove.GetHashCode, false)