Is there possibility to create wordforms for sphinx from aspell?
I've never used Aspell, but it looks easy to dump a dictionary
http://aspell.net/0.50-doc/man-html/5_Working.html
From there should be trivial to convert that file to a wordforms file.
Related
I somehow had a script running on my company's server that basically did a mongodump and then for some reason used recode to encode all .bson files to UTF-8. Thanks to that, I can't use mongorestore, as it says every single .bson file has 268 Mb.
Is there anything one can do to get data back from a recoded to UTF-8 binary BSON file? There's apparently no way to recode it back. Thanks.
OK. This works only on MongoDB, probably, but I'll put it as an answer because it may work for people with this exact problem:
BSON files, while binary, are somewhat readable, depending on your need. In my case, I had a product collection, and most of what I had to update was descriptions and such.
While not a perfect solution, it is possible to just use Notepad++ to turn hex characters into new lines or anything else, and try to parse the resulting file, if you know what you are doing.
Since all fields (name, _id, description) are still there, I recommend turning those into XML headers, for example.
That solved my problem. Thanks.
I want to create a 'Did You Mean' feature on my website. For that, I need to build an index table with trigrams of all possible words. Hence I need to have all these words.
I know that's possible using indexer myindex --buildstops dict.txt 100000 --buildfreqs command. But I don't want to export to dict.txtfile the results and then import it with my PHP script. I want to get all the words immediately and directly.
Is there any way to have this approach?
if the index is a dict=keywords format, can use the indextool --dumpdict command as an alternative - it writes to STDOUT (so can use with popen etc in PHP).
I created a thesaurus for full text search a few months back. I just recently added some entries, and (I think) I update it like this:
ALTER TEXT SEARCH CONFIGURATION english
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
WITH [my_thesaurus], english_stem;
However, I don't actually don't remember what my thesaurus was called. How can I figure this out?
You may find it in the output of:
SELECT dictname FROM pg_catalog.pg_ts_dict;
If you use psql client, you can use the following command.
\dFd[+] PATTERN
lists text search dictionaries
Basically, you can use \dFd+ to list all dictionaries along with their initialization options.
I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.
I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.
I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string.
To date I have been using this:
my #search_results = `grep -i -l \"$string\" *.pdf`;
where $string is the text to look for.
However this fails for most pdf's because the file format is obviously not ASCII.
What can I do that's easiest?
Clarification:
There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.
Final solution using Adam Bellaire's suggestion below:
#search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;
The PerlMonks thread here talks about this problem.
It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:
my #search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;
My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:
my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
my $text = $doc->getPageText($pagenum);
print $text;
}
I second Adam Bellaire solution. I used pdftotext utility to create full-text index of my ebook library. It's somewhat slow but does its job. As for full-text, try PLucene or KinoSearch to store full-text index.
You may want to look at PDF::Core.
The easiest fulltext index/seach I've used is mysql. You just insert into the table with the appropriate index on it. You need to spend some time working out the relative weightings for fields (a match in the title might score higher than a match in the body), but this is all possible, albeit with some hairy sql.
Plucene is deprecated (there hasn't been any active work on it in the last two years afaik) in favour of KinoSearch. KinoSearch grew, in part, out of understanding the architectural limitations of Plucene.
If you have ~300 pdfs, then once you've extracted the text from the PDF (assuming the PDF has text and not just images of text ;) and depending on your query volumes you may find grep is sufficient.
However, I'd strongly suggest the mysql/kinosearch route as they have covered a lot of ground (stemming, stopwords, term weighting, token parsing) that you don't benefit from getting bogged down with.
KinoSearch is probably faster than the mysql route, but the mysql route gives you more widely used standard software/tools/developer-experience. And you get the ability to use the power of sql to augement your freetext search queries.
So unless you're talking HUGE data-sets and insane query volumes, my money would be on mysql.
You could try Lucene (the Perl port is called Plucene). The searches are incredibly fast and I know that PDFBox already knows how to index PDF files with Lucene. PDFBox is Java, but chances are there is something very similar somewhere in CPAN. Even if you can't find something that already adds PDF files to a Lucene index it shouldn't be more than a few lines of code to do it yourself. Lucene will give you quite a few more searching options than simply looking for a string in a file.
There's also a very quick and dirty way. Text in a PDF file is actually stored as plain text. If you open a PDF in a text editor or use 'strings' you can see the text in there. The binary junk is usually embedded fonts, images, etc.