Should I use Parse::RecDescent or Regexp::Grammars to extract tables from documents? - perl

I have lots of large plain text documents I wish to parse with perl. Each document has mostly English paragraphs in it, with a couple of plain text marked up tables in each document.
I have created a grammar to describe the table structure, but am unsure whether it would be best to use Parse::RecDescent or Regexp::Grammars to extract the tables.
I initially leaned towards Parse::RecDescent, but I'm not sure in a grammar how you would deal with the 90% of the document text I want to ignore, in order to find the couple of tables I want to extract buried inside each document.
Perhaps I need Regexp::Grammars so I can "pull" my expression through the document until it finds matches?
Thanks

Regexp::Grammars is what I wanted, as it allows you to pull your grammar through the document and find matches like a regular expression. Parse::RecDescent doesn't seem suited to scanning through a document and finding only the text that matches the grammar.

Related

Filter unneccessary words when doing full text search in PostgreSQL

I've created full text search in postgreSQL based on this wonderfull article.
It works good enough, but thing should be fixed.
Say I have blog post in my DB with text:
"All kittens go to heaven"
If user searches "All kittens go to heaven, may be..." the DB will return nothing, because words may be are not found.
I can post my sql query but it's pretty much the same as described in article.
Is there way to show found articles which have most of searched words?
This is a fundamental problem with PostgreSQL's Text Search.
You could try to pre-parse the query, and strip out any terms that aren't in the "corpus" of terms of all your documents, but that doesn't really solve your problem.
You could try changing your query to 'or' all the terms, but this could have performance problems.
The best bet would be to try the smlar extension (written by the Text Search authors), which can use cosine/tfidf weighting. This means that the query can have terms that aren't in the document and still match.

What is the aliases array in fs.files for?

I want to learn more about MongoDB's GridFS so I had a look at the manual.
It says:
files.aliases
Optional. An array of alias strings.
I know that this field may contain an array of strings, but what are the values inside this array used for? Alternative filenames?
Yes. For instance, from the MongoDB csharp driver source code (MongoGridFSFileInfo.cs, ln 474):
// copy all createOptions except Aliases (which are considered alternate filenames)
However, it's rather unclear what the semantics of this field are. The csharp driver, for instance, won't look for aliases when you search by name. As far as I can see, there's not even an index on aliases, so searching on that field is practically impossible.
In general, keep in mind that GridFS is a mere concept of how to store large files in MongoDB - the implementation isn't special in any way - it's just regular collections and conventions, plus the command line tools. While the general idea of GridFS is neat, it does come with a lot of assumptions and conventions that you might not want to deal with, and that can be painful to work with in statically typed languages. You can always build your own GridFS with a different fieldset, though.

Segmenting words, and grouping hyphenated and apostrophe words from text

I need to segment words from a text. Some times the hyphenated words are written without hyphens, and apostrophe words are written without apostrophe. There are also similar issues like different spelling issues of same words (ex: color, colour), or single word which are written with spaces between them (ex: up to, upto, blankspace, blank space). I need to group these variants as one single representation and insert it into a set/hashmap or some other place. There can be also problems with accented character words written without accent characters (although i haven't faced them yet). Currently and cutting the words at any blankspace character and every non-alphanumerical, and then stemming them, and omitting stop words.
These indexes would be later used for document similarity checking and searching etc. Any suggestions how can i combat these problems? I have thought of an idea to match scanned word with a wordlist, but the problem is that the proper nouns and non-dictionary words will be omitted.
Info: My code is in Java
I think you should apply a combination of techniques.
1) For common spelling variants I would go with a dictionary-based method. Since they are common, I wouldn't worry about missing non-dictionary words. That should solve the color/colour problem.
2) For typos and other non-standard spelling variants you can apply Metaphone (http://en.wikipedia.org/wiki/Metaphone) algorithm to convert the tokens to a representation of their English pronunciations. Similar variants sound similar, thus you can match them to each other (e.g., Jon to John). You can also use edit-distance-based matching algorithms during the query to match very similar tokens with only a pair of characters juxtaposed or a character-dropped (e.g., Huseyin versus Housein).
3) For apostrophe and compound words with hyphen in between, you can store both variants. For example, "John's" would be indexed both as "John s" and "Johns". "blank-space" can be converted to (or stored along with) "blank space" and "blankspace".
4) For compound words without any hyphen in between, you could use an external library such as HyphenationCompoundWordTokenFilterFactory class of Solr (http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html). Although it can use a dictionary, it doesn't have to. It is targeted to deal with compound words that are frequently encountered in German and similar languages. I see no reason why you can't apply it to English (you'll need to supply an English dictionary and hyphenation rule files).
Actually, the last point raises an important question. I don't think you are up to building your own search library from scratch. If that's true why don't you use Lucene (or Solr, which is based on Lucene), a Java-based search library which already have the methods and ways to deal with these problems? For example, the injection technique allows you to index both color and colour in the same place in a document; thus it doesn't matter whether you search for "colored cars" or "coloured cars" (assuming you take care of stemming). There are filters which does the phonetic indexing (http://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html). There is even a FuzzyQuery component which lets you to allow a certain amount of edit distance to match similar terms (http://lucene.apache.org/core/old_versioned_docs/versions/3_2_0/api/all/org/apache/lucene/search/FuzzyQuery.html)
You will also need to decide at which point you want to deal with these problems: One extreme approach is to index all possible variants of these terms during the indexing and use the queries as they are. That will keep your query processing light, but will cost you a larger index (because of all the variants you need to store). The other extreme is to index the documents as they are and expand the queries during the searching. That will allow you to keep your index lean at the cost of heavier query processing. Phonetic indexing would require you to process both your documents during the indexing and the queries during the search. Fuzzy matching would be feasible only during the search time because, presumably, you wouldn't be able to store all edit variants of all terms in the index.

Please advise an optimal solution to full text search in mongoDB

The documents in my database have names and descriptions among other fields. I would like to allow the users to search for those documents by providing some keywords. The keywords should be used to lookup in both the name and the description field. I've read the mongoDB documentation on full text search and it looks really nice and easy if I want to search for keywords in the name field of my documents. However, the description field contains free form text and can take up to 2000 characters, so potentially there are a few hundred words per document. I could treat them the same way as names and just split the whole description into separate words and store it as another tag-like array (as per the Mongo example), but it seems like a terrible idea - each document's size could be almost doubled, plus there are characters like dots, commas, etc.
I know there are specialized solutions for exactly this kind of problems and I was just looking at Lucene.Net, I also saw Solr mentioned here and there.
Should I be looking to implement this search feature in mongoDB or should I use a specialized solution? Currently I just have one instance of mongod and one instance of a web server. We might need to scale later, but for now that is all I use. I'd appreciate any suggestions on how to implement this feature.
If storing the text split out into an array per the documented approach is not viable (I can understand your concerns), then I think you should look into a specialised solution.
Quote from the MongoDB documentation:
MongoDB has interesting functionality
that makes certain search functions
easy. That said, it is not a dedicated
full text search engine.
So, for more advanced full text search functionality I think a dedicated engine would be more suited. I have no experience in this area so I can't offer much in the way of suggestions from here, other than what my thoughts would be if I was in the same boat:
how much work involved in using a dedicated full-text search engine instead of MongoDB's functionality?
does that add more complexity / is it worth it?
would it be quicker/simpler to use MongoDB and just take the hit on the extra disk space?
maybe MongoDB will support better full-text functionality in future (it is rapidly evolving after all)
Fulltext search support is planned for the future. However right now you have to go with Solr & friends. Using the built-in "fulltext" functionality is not really suitable for real world usage.

sqlite Indexing Performance Advice

I have an sqlite database in my iPhone app that I access via the Core Data framework. I'm using NSPredicates to query the database.
I am building a search function that needs to search six different varchar fields that contain text. At the moment, it's very slow and I need to improve performance, probably in the sqlite database. Would it be best to create an index on all those columns? Or would it be better to build a custom index table that expands those six columns into multiple rows, each containing a word and the ID it matches? Any other suggestions?
There are things you can do to improve the performance of searching for text in sqlite databases. Although Core Data abstracts you away from the underlying store it can be good to have an appreciation of what is going on when your store is backed using sqlite.
If we assume you're doing a substring search of these fields there are things you can do to improve search performance. Apple recommend using a derived properties. This amounts to maintaining a normalised version of your property in your model that is used for searching. The derived property should be done in a way that it can be indexed. You then express your search in terms of this derived property using binary operators > <= etc.
I found doing this reduced our search from around 1 second to under 100ms.
To make things clear I would suggest looking at the ADC example http://developer.apple.com/mac/library/samplecode/DerivedProperty/
From the Core Data Programming Guide:
How you use predicates can
significantly affect the performance
of your application. If a fetch
request requires a compound predicate,
you can make the fetch more efficient
by ensuring that the most restrictive
predicate is the first, especially if
the predicate involves text matching
(contains, endsWith, like, and
matches) since correct Unicode
searching is slow. If the predicate
combines textual and non-textual
comparisons, then it is likely to be
more efficient to specify the
non-textual predicates first, for
example (salary > 5000000) AND
(lastName LIKE 'Quincey') is better
than (lastName LIKE 'Quincey') AND
(salary > 5000000).
If there is a way to reorder your query such that the simplest logic is on the left, and the most complex on the right, that can help your search performance. As Lyon suggests, searching Unicode text is extremely expensive, so Apple recommends searching against derived values that strip unicode characters and common phrases like a, and, and the.
I assume these columns store text. The question is how much text and how often this model is accessed. If it is a large amount of text, I would create other properties that held the text, stripping common words and Unicode text. The only downside to this is that you end up with extra properties to maintain. You can do any indexing to improve perf on those columns.
If what you want is essentially full text indexing of your sqlite db, then you may want to use sqlite's ft3 module, since that's exactly what it provides:
http://www.sqlite.org/cvstrac/wiki?p=FtsUsage
http://dotnetperls.com/sqlite-fts3