This question already has answers here:
How do I count the characters, words, and lines in a file, using Perl?
(10 answers)
Closed 4 years ago.
How do I find out linecount in perl similar to what wc -l gives me. A method preferably that doesn't require reading the whole file.
This question is way too broad and un-specific. But as it is based on an often seen misconception I'd like to comment on that.
A file is a sequence of bytes with no notion of "lines." The lines that we see when a file is displayed are determined by a particular character (or a short sequence of characters) in the file, denoting a "linebreak" to be used by software that views or edits the file. They are not a property of a file that one can just look up as metadata.
So you have to read the whole file in order to determine how many "lines" it has.
This can be done using native tools in a language or by running an external utility which does this, like wc. I'd recommended to do it using Perl in a Perl program, since the job fits squarely within Perl's most common uses. Then there are a number of ways to do this but we'd need to see your code in order to offer a specific recommendation.
Related
Perl 5 has the encoding pragma or the Filter::Encoding module, however, I have not found anything similar in Perl 6. I guess eventually source filters will be created, but for the time being, can you use other encodings in Perl 6 scripts?
You cannot write your Perl 6 script in anything except utf8. I don't think there will ever be any other encoding you will be allowed to write your script in, as utf8 is basically the universal standard. Benefits like not having endianess and being back compatible with ASCII are some reasons it has become the standard and not things like utf16 or utf32.
Maybe there was a time before when such a thing may have been useful, but today I do not see that being the case. All text editors in common usage I know of default to utf8, and having files in multiple formats makes it more difficult to share your Perl 6 programs with others. There are plenty of reasons to want to use other encodings external to Perl 6 (writing to files, reading files etc.) but I don't see adding filters as smart move.
Rakudo currently supports an --encoding= option, so you might in theory be able to write a script in a different character encoding, and call it with perl6 --encoding=utf16 yourscript.p6. But in my experiments, I haven't managed to get it working with anything except utf8, and even if it worked, specifying --encoding on the command line would be a big no go for me.
So the operational answer is: currently no.
(And I don't think anybody else has asked for it yet...)
This question already has answers here:
Idiomatic batch processing of text in Emacs?
(3 answers)
Closed 8 years ago.
Imagine I have an input file,an output file and a file containing some elisp code, which should transform the input file into the output file. Is there a way I could do all this from an external process? Maybe some kind of script mode for emacs? I would like to embed this in a web application.
See emacs --batch in the Initial Options section of the manual. Use it with -l, 'f' or--eval. Thebatchoption forcesprin1,princ, andprintto print to stdout andmessageanderror` to print to stderr - so you can actually read and write to pipes.
Yes, it is possible. See emacs -l or emacs --eval.
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How do I implement dispatch tables in Perl?
I have a hash table that contains commands such as int(rand()) etc.
How do I execute those commands?
You can use eval($str) to execute Perl code you store in a string variable, $str. You could alternatively store your code as function references within a hash, so something like:
$hash{'random'} = sub { int(rand()) };
This way, you could write $hash{'random'}->() to execute the function whenever you want a random value.
See also Implementing Dispatch Tables on PerlMonks.
As other have said, you can execute them using eval. However, please note that executing arbitrary strings of possibly tainted origin via eval is a major security hole, as well as prone to be slow if performance of your application matters.
You can use the Safe module to remove the security hole (not sure how bulletproof that is but much better than naked eval), but performance issues will always be there as Perl will have to compile your code prior to executing it WHILE executing the main program.
What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word's, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.
I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?
Here's a link to some Linux word-to-text converters.
For example you could use
antiword file.doc | wc
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format
Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:
NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)
More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w.
Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.
If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm
I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.
Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.
I hope this helps.
Regards. Chris
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 13 years ago.
What do people mean when they say "Perl is very good at parsing"?
How is Perl any better or more powerful than other scripting languages such as Python or Ruby?
They mean that Perl was originally designed for processing text files and has many features that make it easy:
Perl has many functions for string processing: substr, index, chomp, length, grep, sort, reverse, lc, ucfirst, ...
Perl automatically converts between numbers and strings depending on how a value is used. (e.g. you can read the character string '100' from a file and add one to it without needing to do an string to integer conversion first)
Perl automatically handles conversion to and from the platform encoding (e.g. CRLF on Windows) and a logical newline ("\n") within your program.
Regular expressions are integrated into the syntax instead of being a separate library.
Perl's regular expressions are the "gold standard" for power and functionality.
Perl has full Unicode support.
Python and Ruby also have good facilities for text processing. (Ruby in particular took much inspiration from Perl, much as Perl has shamelessly borrowed from many other languages.) There's little point in asking which is better. Use what you like.
Don't take a statement of Perl's strengths to be a statement of another language's failings. Perl is good for text processing, but that doesn't mean Ruby or Python suck.
When people talk about Perl being "good for parsing", they're mainly echoing Perl's history; it was invented in the day when heavy-duty text processing wasn't easy. Try doing some of that in C or C++ (Java hadn't been invented yet, either!). Back in the day, Larry was trying to do his work with sed and awk, but running into their limitations. He made a tool that made text even easier to work with.
Perl is still very good for text manipulation tasks, but now so are a lot of other languages.
Perl is good for ETL or batch processing motions as well. It's a minimal amount of code to pick up the file; push it through split to get a map, perform some logical business actions on the record, and write it back out to disk.
I suppose that's more data processing then data parsing, but data processing is bulk data parsing.
Perl is very good in text parsing, when compared to C/C++/Java.
It's probably because people are used to what it was built for, as described in the perl documentation, so it has become commonplace for many people to associate parsing of text files with Perl. Not to exclude Ruby or Python, it's just more of a household name IMHO.
Perl is a language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).