SphinxSearch: how to get all words with frequencies

SphinxSearch: how to get all words with frequencies - autocomplete

I want to create a 'Did You Mean' feature on my website. For that, I need to build an index table with trigrams of all possible words. Hence I need to have all these words.
I know that's possible using indexer myindex --buildstops dict.txt 100000 --buildfreqs command. But I don't want to export to dict.txtfile the results and then import it with my PHP script. I want to get all the words immediately and directly.
Is there any way to have this approach?

if the index is a dict=keywords format, can use the indextool --dumpdict command as an alternative - it writes to STDOUT (so can use with popen etc in PHP).

Related

How to collect column headers and data using dbisqlc.exe command

I am trying to query a Sybase ASA database using the dbisqlc.exe command-line on a Windows system and would like to collect the column headers along with the associated table data.
Example:
dbisqlc.exe -nogui -c "ENG=myDB;DBN=dbName;UID=dba;PWD=mypwd;CommLinks=tcpip{PORT=12345}" select * from myTable; OUTPUT TO C:\OutputFile.txt
I would prefer it if this command wrote to stdout however that does not appear to be an option aside from using dbisql.exe which is not available in the environment I am in.
When I run it in this format the header and data is generated however in an unparsable format.
Any assistance would be greatly appreciated.

Try adding the 'FORMAT SQL' clause to the OUTPUT statement. It will give you the select statement containing the column names as well as the data.

In reviewing the output from the following dbisqlc.exe command, it appears as though I can parse the output using perl.
Command:
dbisqlc.exe -nogui -c "ENG=myDB;DBN=dbName;UID=dba;PWD=mypwd;CommLinks=tcpip{PORT=12345}" select * from myTable; OUTPUT TO C:\OutputFile.txt
The output appears to break in odd places using text editors such as vi or TextPad however the output from this command is actually returned with specific column widths.
The second line of the output includes a set of ='s signs which are contained for the width of each column. What I did was build a "template" string based on the ='s which can be passed to perls unpack function. I then use this template to build an array of column names and parse the result set using unpack.
This may not be the most efficient method however I think it should give me the results I am looking for.

Configuration Key Value Store

I'm in the planning stages of a script/app that I'm going to need to write soon. In short, I'm going to have a configuration file that stores multiple key value pairs for a system configuration. Various applications will talk to this file including python/shell/rc scripts.
One example case would be that when the system boots, it pulls the static IP to assign to itself from that file. This means it would be nice to quickly grab a key/value from this file in a shell/rc script (ifconfig `evalconffile main_interface` `evalconffile primary_ip` up), where evalconffile is the script that fetches the value when provided with a key.
I'm looking for suggestions on the best way to approach this. I've tossed around the idea of using a plain text file and perl to retrieve the value. I've also tossed around the idea of using YAML for the configuration file since there may end up being a use case where we need multiple values for a key and general expansion. I know YAML would make it accessible from python and perl, but I'm not sure what the best way to access it from a quickly access it from a shell/rc script would be.
Am I headed in the right direction?

One approach would be to simply do the YAML as you wanted, and then when a shell/RC wants a key/value pair, they would call a small Perl script (the evalconffile in your example) that would parse YAML on the shell script's behalf and print out the value(s)

SQLite will give you greatest flexibility, since you don't seem to know the scope of what will be stored in there. It appears there's support for it in all scripting languages you mentioned.

Using boost::program_options

In my program I have a list of pairs - name and size.
I want to build this list from the command line interface using boost::program_options.
It should look something like this:
myProg --value("John",10) --value("Steve",14) --value("Marge",28)
I also need this to be in order - Steve will be after John and before Marge on the list. Is that possible with boost::program_options?
This CLI syntax is just an idea to get the list. If you have a better one, do tell.

You just define your option
("value", value<vector<YourPairType>>()->composing(), "description")
and an appropriate
istream& operator >> (istream& in, YourPairType& pr) { /* ... */ }
that reads a single YourPairType from in in your ("John",10) format. Parsed options will be stored in the order they appear in the command line.
You can achieve greater flexibility if you use custom validators instead of operator >>.

A file with each line having one pair of values can be one option. The file could be a plain ascii text file or you can go for xml files too - refer to boost serialization.

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.

I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.

How can I do a full-text search of PDF files from Perl?

I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string.
To date I have been using this:
my #search_results = `grep -i -l \"$string\" *.pdf`;
where $string is the text to look for.
However this fails for most pdf's because the file format is obviously not ASCII.
What can I do that's easiest?
Clarification:
There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.
Final solution using Adam Bellaire's suggestion below:
#search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;

The PerlMonks thread here talks about this problem.
It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:
my #search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;

My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:
my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
my $text = $doc->getPageText($pagenum);
print $text;
}

I second Adam Bellaire solution. I used pdftotext utility to create full-text index of my ebook library. It's somewhat slow but does its job. As for full-text, try PLucene or KinoSearch to store full-text index.

You may want to look at PDF::Core.

The easiest fulltext index/seach I've used is mysql. You just insert into the table with the appropriate index on it. You need to spend some time working out the relative weightings for fields (a match in the title might score higher than a match in the body), but this is all possible, albeit with some hairy sql.
Plucene is deprecated (there hasn't been any active work on it in the last two years afaik) in favour of KinoSearch. KinoSearch grew, in part, out of understanding the architectural limitations of Plucene.
If you have ~300 pdfs, then once you've extracted the text from the PDF (assuming the PDF has text and not just images of text ;) and depending on your query volumes you may find grep is sufficient.
However, I'd strongly suggest the mysql/kinosearch route as they have covered a lot of ground (stemming, stopwords, term weighting, token parsing) that you don't benefit from getting bogged down with.
KinoSearch is probably faster than the mysql route, but the mysql route gives you more widely used standard software/tools/developer-experience. And you get the ability to use the power of sql to augement your freetext search queries.
So unless you're talking HUGE data-sets and insane query volumes, my money would be on mysql.

You could try Lucene (the Perl port is called Plucene). The searches are incredibly fast and I know that PDFBox already knows how to index PDF files with Lucene. PDFBox is Java, but chances are there is something very similar somewhere in CPAN. Even if you can't find something that already adds PDF files to a Lucene index it shouldn't be more than a few lines of code to do it yourself. Lucene will give you quite a few more searching options than simply looking for a string in a file.
There's also a very quick and dirty way. Text in a PDF file is actually stored as plain text. If you open a PDF in a text editor or use 'strings' you can see the text in there. The binary junk is usually embedded fonts, images, etc.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

SphinxSearch: how to get all words with frequencies - autocomplete

if the index is a dict=keywords format, can use the indextool --dumpdict command as an alternative - it writes to STDOUT (so can use with popen etc in PHP).

Related

How to collect column headers and data using dbisqlc.exe command

Configuration Key Value Store

Using boost::program_options

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

How can I do a full-text search of PDF files from Perl?

Categories

Resources