How can I wrap lines to 45 characters in a Perl program? - perl

I have this text I am writing in a Perl CGI program:
$text = $message;
#lines = split(/\n/, $text);
$lCnt .= $#lines+1;
$lineStart = 80;
$lineHeight = 24;
I want to force a return after 45 characters. How do I do that here?
Thanks in advance for your help.

Look at the core Text::Wrap module:
use Text::Wrap;
my $longstring = "this is a long string that I want to wrap it goes on forever and ever and ever and ever and ever";
$Text::Wrap::columns = 45;
print wrap('', '', $longstring) . "\n";

Check out Text::Wrap. It will do exactly what you need.

Since Text::Wrap for some reason doesn't work for the OP, here is a solution using a regex:
my $longstring = "lots of text to wrap, and some more text, and more "
. "still. thats right, even more. lots of text to wrap, "
. "and some more text.";
my $wrap_at = 45;
(my $wrapped = $longstring) =~ s/(.{0,$wrap_at}(?:\s|$))/$1\n/g;
print $wrapped;
which prints:
lots of text to wrap, and some more text, and
more still. thats right, even more. lots of
text to wrap, and some more text.

The Unicode::LineBreak module can do more sophisticated wrapping of non-English text (Especially East Asian scripts) than Text::Wrap, and has some nice features like optionally being able to recognize URIs and avoid splitting them.
Example:
#!/usr/bin/env perl
use warnings;
use strict;
use Unicode::LineBreak;
my $longstring = "lots of text to wrap, and some more text, and more "
. "still. thats right, even more. lots of text to wrap, "
. "and some more text.";
my $wrapper = Unicode::LineBreak->new(ColMax => 45, Format => "NEWLINE");
print for $wrapper->break($longstring);

Related

How do I select text between two specific symbols in Perl?

I'm very new to Perl. I'm currently going through this Perl file and I've got this variable where I was able to format it down to get all the text after the "<" symbol using this line I found from another stackflow question.
($tempVariable) = $Line =~ /(\<.*)\s*$/;
So currently whenever I print this variable, I get the output
$tempVariable = <some text here #typeOf and more text here after
I need to get everything between the "<" symbol and the "#"symbol.
I tried looking at other stackflow questions and tried implementing it to mines but I keep getting errors so if anybody could help me out I would appreciate it.
my ($substr) = $str =~ /<([^<\#]*)\#/
or die "No match";
You'll need a regex that
looks for the starting < character
then (your question is unclear on this point)
captures one-or-more non-# characters, or
captures zero-or-more non-# characters
looks for the trailing # character
also not specified in your question: do you need to strip leading and trailing white space from the match?
I.e.
#!/usr/bin/perl
use warnings;
use strict;
my $Line = '<some text here #typeOf and more text here after';
my $tempVariable;
# alternative 1: one-or-more characters
($tempVariable) = $Line =~ /<([^#]+)#/
or die "No match alternative 1";
print "Alternative 1: '${tempVariable}'\n";
# alternative 2: zero-or-more characters
($tempVariable) = $Line =~ /<([^#]*)#/
or die "No match alternative 2";
print "Alternative 2: '${tempVariable}'\n";
exit 0;
Test run (white space is not stripped):
$ perl dummy.pl
Alternative 1: 'some text here '
Alternative 2: 'some text here '

Text::SpellChecker module and Unicode

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );
while ( my $word = $checker->next_word ) {
print "Bad word is $word\n";
}
Output: Bad word is rdinator
Desired: Bad word is coördinator
The module is breaking if I have Unicode in $text. Any idea how can this be solved?
I have Aspell 0.50.5 installed which is being used by this module. I think this might be the culprit.
Edit: As Text::SpellChecker requires either Text::Aspell or Text::Hunspell, I removed Text::Aspell and installed Hunspell, Text::Hunspell, then:
$ hunspell -d en_US -l < badword.txt
coördinator
Shows correct result. This means there's something wrong either with my code or Text::SpellChecker.
Taking Miller's suggestion in consideration I did the below
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text = "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
print "Bad word is $word\n";
}
OUTPUT:
Flag is 1
Text is coördinator
Bad word is rdinator
Does this mean the module is not able to handle utf8 characters properly?
It is Text::SpellChecker bug - the current version assumes ASCII only words.
http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm
#
# next_word
#
# Get the next misspelled word.
# Returns false if there are no more.
#
sub next_word {
...
while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {
IMHO the best fix would use per language/locale word splitting regular expression or leave word splitting to underlaying library used. aspell list reports coördinator as single word.
I've incorporated Chankey's solution and released version 0.12 to the CPAN, give it a try.
The validity of diaeresis in words like coördinator is interesting. The default aspell and hunspell dictionaries seem to mark it as incorrect, though some publications may disagree.
best,
Brian

Strip paragraph tags from a block of text

I am just trying to get my head round pregreplace, but it is driving me round the bend. All the examples I seem to find are a million times complicated than I need.
What im trying to do is just strip the paragraph tags from a block of text.
so.......
$text = '<p>Some block of text</p>';
should become
$afterreplace = 'Some block of text';
so im wondering how on earth I do this with preg_replace($pattern, $replacement, $string);
I kind of get this far but then, im not sure how to tell it to strip paragraphs........
preg_replace($pattern, $replacement, $text);
For the specific example i would just use str_replace. It is way faster.
You just replace "<p>" and "</p>" with " ".
This worked in the end >>
$text = '<p>Some Long Text String</p>';
$replaced = str_replace(array('<p>','</p>'),array('',''),$text);

Difficulty with Logic parsing ICAL feed

I have been working on a code that will parse event information from an Ical feed. It is a huge block of data that I want to divide by key term. I need it to be done in an orderly way. I tried indexing the key terms and then having the program print what is between those indexes. However for some reason it became in infinite loop that printed all the data. I don't know how to fix it. DO NOT RUN MY CODE IT KEEPS FREEZING MY COMPUTER. I was hoping someone could show me what my problem is.
DO NOT RUN THIS PROGRAM
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;
use HTML::FormatText;
my $URL= get("https://www.events.utoronto.ca/iCal.php?ical=1&campus=0&
+sponsor%5B%5D=&audience%5B%5D=&category%5B%5D=");
my $Format=HTML::FormatText->new;
my $TreeBuilder=HTML::TreeBuilder->new;
$TreeBuilder->parse($URL);
my $Parsed=$Format->format($TreeBuilder);
open(FILE, ">UOTSUMMER.txt");
print FILE "$Parsed";
close (FILE);
open (FILE, "UOTSUMMER.txt");
my #array=<FILE>;
my $string ="#array";
my $offset = 0; # Where are we in the string?
my $numResults = 0;
while (1) {
my $idxSummary = index($string, "SUMMARY", $offset);
my $result = "";
my $idxDescription = index ($string, "DESCRIPTION", $offset);
my $result2= "";
if ($idxSummary > -1) {
$offset = $idxSummary + length("SUMMARY");
my $idxDescription = index($string, "DESCRIPTION", $offset);
if ($idxDescription == -1) {
print "(Data malformed: missing DESCRIPTION line.)\n";
last;
}
if ($idxDescription > -1) {
$offset = $idxDescription+ length("DESCRIPTION");
my $idxLocation= index($string, "LOCATION", $offset);
if ($idxLocation == -1) {
print "(Data malformed: missing LOCATION line.)\n";
last;
}
my $length = $idxDescription - $offset;
my $length2= $idxLocation - $offset;
$result = substr($string, $offset, $length);
$result2= substr ($string, $offset, $length2);
$offset = $idxDescription + length("DESCRIPTION");
$result =~ s/^\s+|\s+$//g ; # Strip leading and trailing white space, including newlines.
$result2 =~ s/^\s+|\s+$//g ;
$numResults++;
} else {
print "(All done. $numResults result(s) found.)\n";
last;
}
open (FILE2, "UOT123.txt")
print FILE2 "TITLE: <$result>\n DESCRIPTION: <$result2>\n";
Any guidance you may have will be greatly appreciated! Thanks!
I was so inspired by your warnings that I had to run it. I even installed the required modules to do so. Your computer is probably just getting bogged down by the endless loop, not really crashing.
Looking at your code, the problem is almost certainly your indexing. As it stands now, your looping logic is kind of a mess. Your best bet would be to rethink how you are doing this. Rather than using all of this logic, try making the loop dependent on going through the file. That way, it will be much harder to make an endless loop. Also, regular expressions will make this job much simpler. This probably doesn't do exactly what you want, but it is a start:
while ($string =~ m/SUMMARY(.+?)DESCRIPTION(.+?)(?=SUMMARY|$)/gcs)
{
print "summary is: \n\n $1 \n\n description is: \n\n $2 \n\n";
}
Some other quick points:
Writing to a file and then opening it and reading the content back out again at the beginning doesn't make much sense. You already have exactly what you want in $Parsed.
If you just want to print a variable by itself, don't put it in quotes. This adds a lot of overhead.
Perhaps the following will assist you with your parsing task:
use Modern::Perl;
use LWP::Simple qw/get/;
use HTML::Entities;
my $html = get 'https://www.events.utoronto.ca/iCal.php?ical=1&campus=0&+sponsor%5B%5D=&audience%5B%5D=&category%5B%5D=';
while ( $html =~ /(Summary:\s*[^\n]+)\s*(Description:\s*[^\n]+)/gi ) {
say decode_entities($1) . "\n" . decode_entities($2);
}
Sample Output:
SUMMARY:Learning Disabilities Parent Support Group
DESCRIPTION: Dates: Thursdays: May 24, June 21, and July 19
SUMMARY:"Reading to Write"
DESCRIPTION: Leora Freedman, Coordinator, English Language Learning Program, Faculty of Arts & Science
SUMMARY:The Irish Home Rule Bill of 1912: A Centennial Symposium
DESCRIPTION: One-day symposium presented by the Celtic Studies Program, St. Michael's College
If html entities are OK within the text, you can omit using HTML::Entities and the decode_entities($1) notation, else you may get results like the following:
DESCRIPTION: Leora Freedman, Coordinator, English Language Learning Program, Faculty of Arts & Science
Hope this helps!

How do I count the "real" words in a text with Perl?

I've run into a text processing problem. I've an article, and I'd like to find out how many "real" words there are.
Here is what I mean by "real". Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I'd like to find out is how many words there are, skipping like "-" dashes and "," commas with spaces, etc.
I tried doing this:
my #words = split ' ', $article;
print scalar #words, "\n";
But that includes various punctuations that have spaces in them as words.
So I'm thinking of using this:
my #words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar #words, "\n";
This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?
Does anyone know maybe of a module on CPAN that does this?
Try to use: \W - any non-word character, and also drop _
Solution
use strict;
my $article = 'abdc, dd_ff, 11i-11, ff44';
# case David's, but it didn't work with I'm or There's
$article =~ s/\'//g;
my $number_words = scalar (split /[\W_]+/, $article);
print $number_words;
I think your solution is about as good as you're going to get without resorting to something elaborate.
You could also write it as
my #words = $article =~ /\S*\w\S*/
or count the words in a file by writing
my $n = 0;
while (<>) {
my #words = /\S*\w\S*/g;
$n += #words;
}
say "$n words found";
Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.