How can I get the text in the Perl/Tk Text widget? - perl

I have written a script that gets a file name and insert the file content to a Text widget. Now when I close the script window, I need it to write the text onto the Unix screen.
How can I get the Text widget content?
My text widget insertion sorce code is:
open(FILE, $file_name);
foreach my $line (<FILE>) {
$text->insert('end', $line);
}

$text->get('1.0','end-1c');
(It's end-1c – end less one character – for fairly technical reasons; with just end you'd get an extra newline appended. A known Tk gotcha.)

Related

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}
It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}
You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.
If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

extracting paragraphs from text with perl

I want to extract the paragraphs from a text variable that retrieved from the DB.
for extracting the pargaphs from file handler i use the below code :
local $/ = undef;
#paragarphs =<STDIN>
what is the best option to extract paragraphs from a text variable using perl and if there are module on cpan that do this type of task ?
You're almost there. Setting $/ to undef will slurp in the entire text in one go.
What you want is local $/ = ""; to enable paragraph mode, as per perldoc perlvar (emphasis my own):
$/
The input record separator, newline by default. This influences Perl's
idea of what a "line" is. Works like awk's RS variable, including
treating empty lines as a terminator if set to the null string (an
empty line cannot contain any spaces or tabs). You may set it to a
multi-character string to match a multi-character terminator, or to
undef to read through the end of file. Setting it to "\n\n" means
something slightly different than setting to "" , if the file contains
consecutive empty lines. Setting to "" will treat two or more
consecutive empty lines as a single empty line. Setting to "\n\n"
will blindly assume that the next input character belongs to the next
paragraph, even if it's a newline.
Of course, it is possible to get a filehandle to read from a string instead of a file:
use strict;
use warnings;
use autodie;
my $text = <<TEXT;
This is a paragraph.
Here's another one that
spans over multiple lines.
Last paragraph
TEXT
local $/ = "";
open my $fh, '<', \$text;
while ( <$fh> ) {
print "New Paragraph: $_";
}
close $fh;
Output
New Paragraph: This is a paragraph.
New Paragraph: Here's another one that
spans over multiple lines.
New Paragraph: Last paragraph
You already have the answer for a script (local $/ = "";), but it may be worth noting that there is a shortcut for one-liners: the -00 option.
perl -00 -ne '$count++; END {print "Counted $count paragraphs\n"}' somefile.txt
From man perlrun :
-0[octal/hexadecimal]
specifies the input record separator ($/) [...]
The special value 00 will cause Perl to slurp files in paragraph
mode.
If the text is in a variable, for example:
$text = "Here is a paragraph.\nHere is another paragraph.";
or:
$text = 'Paragraph 1
Paragraph2';
You can simply get the paragraphs by splitting the text with "\n".
#paragraphs = split("\n",$text);
If your paragraphs are separated by double newlines or a combination of \n and \r (like in Windows) you can change the split command accordingly.

Compare two UTF-8 text files and ignore lines that are blank or all whitespace

I am an author maintaining Kindle(HTML) and Open Office versions of a book. I sometimes forget to make changes to one or the other, and the documents are diverging.
My procedure is to copy the text from each and paste into separate text files (using paste and match style in TextEdit) in UTF-8, then perform a differencing operation. However the HTML paste adds blank lines between paragraphs.
I have a file differencing tool, but it has no option to ignore blank lines. My thought was to write a Perl script to remove the blank lines. However, the output of that script screws up the special characters - like ndashes, curly quotes, etc. I have tried using BINMODE and other tricks, to no avail.
I will accept a pointer to a free comparator for MAC OS X that ignores blank lines, or a way to get Perl to not screw up the UTF-8 special characters. I am using Perl 5.14. I prefer answers that do not rely upon newer features, but if I have to install a new Perl, I will.
UPDATE:
This does not work:
use open IO => ":encoding(iso-8859-7)";
open(FILE, "From HTML.txt") or die "$!\n";
open(OUT, ">From HTML - no blank lines.txt") or die "$!\n";
while(<FILE>) {
next if /^\s*$/;
print OUT $_;
}
close FILE; close OUT;
I also tried calling binmode(OUT, ":utf8");
UPDATE: Tried without success this tip from another Stackoverflow question:
open(my $fh, "<:encoding(UTF-8)", "filename");
GNU diff has -B/--ignore-blank-lines and -b/--ignore-space-change.
Err, that "use open" says that your data is not UTF-8. Try binmode on both FILE and OUT?
I ended up using the XCode text editor. By selecting a newline and pasting it into the search/replace dialog, I was able to replace all double newlines with single newlines.
Then I saved the file and used my Compare utility.

Perl format() issue - Empty lines between paragraphs get stripped and cannot be displayed

I'm using Perl's format and write functions to output some text.
The requirements are below:
Print an article (length unknown) using Perl format.
Maximum 80 characters per line.
Last word should be wrapped to the next line if there is not enough space.
Empty lines between paragraphs need to be retained.
The problem I'm having now is any blank lines between paragraphs cannot be displayed. I checked, and this seems to be caused by the use of "~~".
The format is defined as below.
format FULL_TEXT =
Full Story:
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~~
$storyBody
.
Is there a way to print this empty line between paragraphs while still meeting other requirements?
For example, below is what I expect. However, as I said before, the empty line between the two paragraphs is stripped and cannot be displayed.
COLLINGWOOD unfurled its 2010 premiership flag at the MCG last night and marked
the occasion as protocol demanded, by lowering the colours of its
longeststanding rival, Carlton in a contest that was epic in style, if not
consequence.
The crowd was 88,181, a record for home-and-away contests between these clubs. An old feeling stirring in the AFL.
The trick is to split the text into paragraphs and write each paragraph at a time.
use strict;
use warnings;
# slurp text
my $text = do { local $/; <> };
# split into paragraphs
my #paragraphs = split /\n+/, $text;
# define format, including newline at the end
format STDOUT =
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$_
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ~~
$_
.
# write text to format
write for #paragraphs;
Call it like this:
perl /tmp/fmt.pl < /tmp/article.txt
If you want to or have to save memory because your articles are so big, you can combine the first two steps:
use strict;
use warnings;
# slurp text into paragraphs
my #paragraphs = split /\n+/, do { local $/; <> };
# define format, including newline at the end
format STDOUT =
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$_
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ~~
$_
.
write for #paragraphs; # write text to format
use 5;
use strictures;
use Perl6::Form;
my $storyBody = 'COLLINGWOOD unfurled its 2010 premiership flag at the MCG last night and marked the occasion as protocol demanded, by lowering the colours of its longeststanding rival, Carlton in a contest that was epic in style, if not consequence.
The crowd was 88,181, a record for home-and-away contests between these clubs. An old feeling stirring in the AFL.';
my $form = form
'Full Story:',
'{[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[}',
map {s/\n/\r\r/; $_} $storyBody;
print $form;
Output:
Full Story:
COLLINGWOOD unfurled its 2010 premiership flag at the MCG last night and marked
the occasion as protocol demanded, by lowering the colours of its longeststanding
rival, Carlton in a contest that was epic in style, if not consequence.
The crowd was 88,181, a record for home-and-away contests between these clubs. An
old feeling stirring in the AFL.
Semantics of \r in form(?:at)?s

How to output one screen at a time in Perl

When executing a Perl script from the command line, how can I ensure that my output doesn't scroll off the screen?
In other words, how do I mimic the functionality of the Unix more or less commands?
The Term::Pager module would seem to be what you're looking for.
The user can just pipe the output to less. That gives them the option of using their favourite pager, or even not using any pager at all, if they prefer that.
As Matti Virkkunen says, it's better that the user pipes your script to less.
A user of a Unix-like system would expect output in plain text, so (s)he can pipe it to other commands if they need to. Making your script not displaying output as plain text, your user may find your script less usable.
For a quick-and-dirty way, you can pipe the text to less or more:
my $text = <<'EOD';
Lots
and
lots
of
text
EOD
my $pager = $ENV{PAGER} || 'less';
open(my $less, '|-', $pager, '-e') || die "Cannot pipe to $pager: $!";
print $less $text;
close($less);
There are various less and more flags to allow the script to continue when it reaches the bottom of the text.