Reliably getting a character count for .doc files - ms-word

What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word's, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.
I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?

Here's a link to some Linux word-to-text converters.
For example you could use
antiword file.doc | wc
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format

Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:
NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)
More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w.

Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.

If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm
I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.
Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.
I hope this helps.
Regards. Chris

Related

What is "dont" and "isnt" in the pertained GloVe vector files (e.g. glove.6B.50d.txt)?

I found these 2 words "dont" and "isnt" in the vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. I wonder if they were originally "don't" and "isn't". This will likely depend on the sentence_to_word parsing algorithms they used. If someone is familiar, please confirm if this is the case.
A secondary question is if this is a common way to deal with apostrophe for words like "don't", "isn't", "hasn't" and so on. i.e. just filter replace that apostrophe with an empty string such that "don" and "t" becomes one word.
Finally, I am also not sure if GloVe comes with API to do sentence_to_word parsing so you can be consistent with what the researchers have done originally.
I think dont and isnt really are originally don't and isn't. I have seen a few other such examples. I suspect this is just the specific way GloVe researchers handle this.

Alternative to do complex operations in text files using notepadpp?

I'm looking for a practical way to make complex operations in textual files.
From time to time I have the need to develop a whole application (usually in C++ or C#.Net) just to tweak textual configuration files (as .ini, .xml, .txt, etc).
Today I have the need to modify a .txt file with a well-known pattern of setting values to variables. I need to change the value of one specific variable (that appears many times in the file) by multiplying it for a constant (I first thought to use notepadpp + regex backreference but as I found in this thread: How to do a calculation using regex backreference in notepadpp? it seems to beimpossible).
Just when I thought to start developing another heavy desktop tool to accomplish this trivial task I thought if this is the way everyone smarter than me actually do this kind of thing. I thought there could be a notepadpp plugin that allow for complex operations in text using some kind of scripting language but I couldn't find any.
Thanks in advance.

What's a Good package for Phonetic Representation for Various Human Languages?

I'm currently working on a project for which I think being able to come up with phonetic representations of words in various languages would be really helpful. I know Aspell does this pretty well, but I don't think there's a very easy way to get at their phonetic representations, so I ask: is there some other good package for getting the phonetic representation of a word given the word and the language/dialect/accent/whatever it's coming from?
This doesn't need to be in any particular language, but if it were Perl, that would be best.
I've already tried Soundex, Metaphone, DoubleMetaphone, and everything else in Text::Phonetic, and none of that stuff was very good – definitely nowhere near as good as the stuff in Aspell.
The first thing that springs to mind is Soundex. Of course, there is a Perl module Soundex, too. While this is designed to generate a soundex "key" from input it might be useful in mapping different variants to a common key.
There is a package Text::Aspell in CPAN. Might be useful.
I you are trying to make a google style suggestion/correction system, it's not based on just phonetics or AI, but on a massive amount of user input. When a user makes a search, and doesn't click in any link but corrects the input and searches again, it gives google a lot of data about "correct" writing than a phonetics test or dictionary matching.
The main problem is in human language itself, it's not that people speak or write in a deterministic way, let alone in multiple languages.
Of course , i might be wrong, but if you need a library that let's you do this:
getLanguage(string);
I want to see that working, really.

Tool to compare/diff HTML in bulk

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.
Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.
(Oh and it needs to run under linux)
You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.
In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.
At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.
Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.
I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/
Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net
(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)
Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.
I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!

iPhone: RegexKit vs. RegexKit lite - going through an array takes very very long

For my app I need to see if an url is Matched by a regex string. so I created an array with all the regex strings (about 1000+ strings) and check them using RegexKit lite:
for (NSString * aString in mainDelegate.whiteListArray) {
if (![urlString isMatchedByRegex:aString]) {
it works but sadly this operation takes very very long. at least 20 seconds for a webpage like google.com
I've tried using the "normal" RegexKit.framework, because it has an method called (BOOL)isMatchedByAnyRegexInArrayNSArray *)regexArray which is much faster. I can build the app, but whenever I try to launch it it crashes with the following error:
dyld: Library not loaded: #executable_path/../Frameworks/RegexKit.framework/Versions/A/RegexKit
Referenced from: /Users/Reilly/Library/Application Support/iPhone Simulator/User/Applications/7E057EA8-5CD1-465B-8102-38A53A9B5F5B/Drowser.app/Drowser
Reason: image not found
I guess it's because the RegexKit is not meant for arm? (to include the RegexKit I followed the how to which comes in the documentation)
so my question are:
Do you know of any faster way to check a string if it's being matched by any of 1000 regexs.
or do you know how to use the "normal" RegexKit on iPhone or any other regex framework which would do what I need in under a second?
thanks in advance
Note: I am the author of RegexKit et al.
This is a fairly complicated answer.. :)
First, matching a thousand regexes with any of commonly available regex engine implementations is going to be fairly slow, save for perhaps the TCL and TRE regex engines. The reason why RegexKit.framework greatly outperforms RegexKitLite for this task is RegexKit.framework has quite a bit of non-trivial, optimized code for just this task. The reason for this is because it's used in Safari AdBlock, which needs to perform bulk matches of regexes against URLs. It keeps the list of regexes in sorted order, based on the number of times they made a successful match. This is based on the observation that some regex patterns used in Safari AdBlock match much more frequently than others, and trying those first dramatically reduces the amount of regexes that need to be tried to determine if there's a 'hit'. There is also a small negative hit cache as well, along with a lot of multithreading code to do the matches in parallel. None of this will ever make it in to the Lite version as it is definitely not a light-weight feature- there's probably 60-70KB of code just to implement this one feature alone, not to mention the huge memory footprint of keeping a thousand compiled regexes around.
Using RegexKitLite to do this kind of pattern matching is bound to be very, very slow. The first problem is that it only keeps a small cache of compiled regexes that have recently been used. By default, the cache is set to just 23, so tossing a thousand regexes at it is going to cause every regex to be compiled each time its used.
As others have pointed out, RegexKit.framework isn't really set up to be used on the iPhone. Even if you got around the "linking to external frameworks" provision, the default build of RegexKit.framework does not include the arm architecture in its fat binary (it includes ppc, ppc64, i386, and x86_64). What you really need to do is set up a new build target that creates a static library. Not terribly hard to do, really.
I'm afraid that if this kind of pattern matching is something you need to do, you're probably going to have to roll your own regex engine. What you need is a regex engine that can take your thousand regexes and concatenate them together, such as "r1|r2|r3|r4". Most regex engines, and in particular pcre and ICU (the ones used by RegexKit.framework and RegexKitLite, respectively), evaluate such a regex in an almost left to right manner. What's needed is an almost DFA like engine that evaluates all possible states concurrently. See this link for more information. I've built such a regex engine, one that even handles back-references (much easier to do than everyone says) in ~O(M*log2(N)) (M being the size of the text to match, N being the size of the regex) time, but it's not finished. If it was, it would cut through this kind of problem like a plasma torch through butter.
I am aware of at least one person porting RegexKit.framework to the iPhone, though: Mobile Safari AdBlock. AFAIK, it's also a port of the desktop version of Safari AdBlock. I don't know many details, but I think it requires a jail-broken iPhone to install.
In summary, I don't think there's any turn-key solutions available for iPhone development that do anything close to what you need. Your best bet, other than creating your own regex engine, is to look in to the TRE regex engine and try some experiments using concatenated regexes. Be prepared to roll up your sleeves, though, as you're going to have to get your hands very dirty and deal with the guts of Cocoa's strings, Unicode encodings, and all kinds of other unpleasant stuff- the kind of stuff that RegexKitLite takes care of for you behind the scenes.
Are you copying RegexKit.framework into the frameworks folder of your iPhone app?
iPhone does not support embedded frameworks, so the directions there would not work even if it was built for arm. You can only use things that are statically linked, so you will either need to modify regexkit to build as a static library, or include it's source code directly in your project.