Perl: break down a string, with some unique constraints - perl

I'm using Perl to feed data to an LCD display. The display is 8 characters wide. The strings of data to be displayed are always significantly longer than 8 characters. As such, I need to break the strings down into "frames" of 8 characters or less, and feed the "frames" to the display one at a time.
The display is not intelligent enough to do this on its own. The only convenience it offers is that strings of less than 8 characters are automatically centered on the display.
In the beginning, I simply sent the string 8 characters at a time - here goes 1-8, now 9-16, now 17-24, etc. But that wasn't especially nice-looking. I'd like to do something better, but I'm not sure how best to approach it.
These are the constraints I'd like to implement:
Fit as many words into a "frame" as possible
No starting/trailing space(s) in a "frame"
Symbol (ie. hyphen, ampersand, etc) with a space on both sides qualifies as a word
If a word is longer than 8 characters, simulate per-character scrolling
Break words longer than 8 characters at a slash or hyphen
Some hypothetical input strings, and desired output for each...
Electric Light Orchestra - Sweet Talkin' Woman
Electric
Light
Orchestr
rchestra
- Sweet
Talkin'
Woman
Quarterflash - Harden My Heart
Quarterf
uarterfl
arterfla
rterflas
terflash
- Harden
My Heart
Steve Miller Band - Fly Like An Eagle
Steve
Miller
Band -
Fly Like
An Eagle
Hall & Oates - Did It In A Minute
Hall &
Oates -
Did It
In A
Minute
Bachman-Turner Overdrive - You Ain't Seen Nothing Yet
Bachman-
Turner
Overdriv
verdrive
- You
Ain't
Seen
Nothing
Yet
Being a relative Perl newbie, I'm trying to picture how would be best to handle this. Certainly I could split the string into an array of individual words. From there, perhaps I could loop through the array, counting the letters in each subsequent word to build the 8-character "frames". Upon encountering a word longer than 8 characters, I could then repetitively call substr on that word (with offset +1 each time), creating the illusion of scrolling.
Is this a reasonable way to accomplish my goal? Or am I reinventing the wheel here? How would you do it?

The base question is to find all consecutive overlapping N-long substrings in a compact way.
Here it is in one pass with a regex, and see the end for doing it using substr.
my $str = join '', "a".."k"; # 'Quarterflash';
my #eights = $str =~ /(?=(.{8}))/g;
This uses a lookahead which also captures, and in this way the regex crawls up the string character by character, capturing the "next" eight each time.
Once we are at it, here is also a basic solution for the problem. Add words to a buffer until it would exceed 8 characters, at which point it is added to an array of display-ready strings and cleared.
use warnings;
use strict;
use feature 'say';
my $str = shift // "Quarterflash - Harden My Heart";
my #words = split ' ', $str;
my #to_display;
my $buf = '';
foreach my $w (#words) {
if (length $w > 8) {
# Now have to process the buffer first then deal with this long word
push #to_display, $buf;
$buf = '';
push #to_display, $w =~ /(?=(.{8}))/g;
}
elsif ( length($buf) + 1 + length($w) > 8 ) {
push #to_display, $buf;
$buf = $w;
}
elsif (length $buf != 0) { $buf .= ' ' . $w }
else { $buf = $w }
}
push #to_display, $buf if $buf;
say for #to_display;
This is clearly missing some special/edge cases, in particular those involving non-word characters and hyphenated words, but that shouldn't be too difficult to add.†
Here is a way to get all consecutive 8-long substrings using substr
my #to_display = map { substr $str, $_, 8 } 0..length($str)-8;
† Example, break a word with hyphen/slash when it has no spaces around it (per question)
my #parts = split m{\s+|(?<=\S)[-/](?=\S)}, $w;
The hyphen/slash is discarded as this stands; that can be changed by capturing the pattern as well and then filtering out elements with only spaces
my #parts = grep { /\S/ } split m{( \s+ | (?<=\S) [-/] (?=\S) )}x, $w;
These haven't been tested beyond just barely. Can fit in the if (length $w > 8) branch.
The initial take-- The regex was originally written with a two-part pattern. Keeping it here for record and as an example of use of pair-handling functions from List::Util
The regex below matches and captures a character, followed by a lookahead for the next seven, which it also captures. This way the engine captures 1 and 7-long substrings as it moves along char by char. Then the consecutive pairs from the returned list are joined
my $str = join '', "a".."k"; # 'Quarterflash';
use List::Util qw(pairmap);
my #eights = pairmap { $a . $b } $str =~ /(. (?=(.{7})) )/gx;
# or
# use List::Util qw(pairs);
# my #eights = map { join '', #$_ } pairs $str =~ /(.(?=(.{7})))/g;

Related

How can I remove all the vowels unless they are in word beginnings?

$text = "I like apples more than oranges\n";
#words = split /” “/, $text;
foreach (#words) [1..] {
if $words "AEIOUaeiou";
$words =~ tr/A E I O U a e i o u//d;
}
print "$words\n";
"I like apples more than oranges" will become "I lk appls mr thn orngs". "I" in "I", "a" in "appls" and "o" in "orngs" will stay because they are the first letter in the word.
This is my research assignment as a first year student. I am allowed to ask questions and later cite them. Please don't be mean.
I know you say you are not allowed to use a regex, but for everyone else that shows up here I'll show the use of proper tools. But, then I'll do something just as useful with tr///.
One of the tricks of programming (and mathematics) decomposing what look like hard problems into easier problems, especially if you already have solutions for the easy problems. (Read about Parnas decomposition, for example).
So, the question is "How can I remove all the vowels unless they are in word beginnings?" (after I made your title a bit shorter). This led the answers to think about words, so they split up the input, did some work to ensure they weren't working on the first character, and then reassembled the result.
But, another way to frame the problem is "How do I remove all the vowels that come after another letter?". The only letter that doesn't come after another letter is the first letter of a word.
The regex for a vowel that comes after another letter is simple (but I'll stick to ASCII here, although it is just as simple for any Unicode letter):
[a-z][aeiou]
That only matches when there is a vowel after the first letter. Now you want to replace all of those with nothing. Use the substitution operator, s///. The /g flag makes all global substitutions and the /i makes it case insensitive:
s/[a-z][aeiou]//gi;
But, there's a problem. It also replaces that leading letter. That's easy enough to fix. The \K in a substitution says to ignore the part of the pattern before it in the replacement. Anything before the \K is not replaced. So, this only replaces the vowels:
s/[a-z]\K[aeiou]//gi;
But, maybe there are vowels next to each other, so throw in the + quantifier for "one or more" of the preceding item:
s/[a-z]\K[aeiou]+//gi;
You don't need to care about words at all.
Some other ways
Saying that a letter must follow another letter has a special zero-width assertion: the non-word boundary, \B (although that also counts digits and underscore as "letters"):
s/\B[aeiou]+//gi;
The \K was introduced v5.10 and was really a nifty trick to have a variable-width lookbehind. But, the lookbehind here is fixed width: it's one character:
s/(?<=[a-z])[aeiou]+//gi;
But, caring about words
Suppose you need to handle each word separately, for some other requirement. It looks like you've mixed a little Python-ish sort of code, and it would be nice if Perl could do that :). The problem doesn't change that much because you can do the same thing for each individual word.
foreach my $word ( split /\s+/, $x ) {
.... # same thing for each word
}
But, here's an interesting twist? How do you put it all back together? The other solutions just use a single space assuming that's the separator. Maybe there should be two spaces, or tabs, or whatever. The split has a special "separator retention mode" that can keep whatever was between the pieces. When you have captures in the split pattern, those capture values are part of the output list:
my #words_and_separators = split /(\s+)/, $x;
Since you know that none of the separators will have vowels, you can make substitutions on them knowing they won't change. This means you can treat them just like the words (that is, there is no special case, which is another thing to think about as you decompose problems). To get your final string with the original spacing, join on the empty string:
my $ending_string = join '', #words_and_separators;
So, here's how that might all look put together. I'll add the /r flag on the substitution so it returns the modified copy instead of working on the original (don't modify the control variable!):
my #words;
foreach my $word ( split /(\s+)/, $x ) {
push #words, $word =~ s/\B[aeiou]+//gr;
}
my $ending_string = join '', #words;
But, that foreach is a bit annoying. This list pipeline is the same, and it's easier to read these bottom to top. Each thing produces a list that flows into the thing above it. This is how I'd probably express it in real code:
my $ending_string =
join '',
map { s/\B[aeiou]+//gr } # each item is in $_
split /(\s+)/, $x;
Now, here's the grand finale. What if we didn't split thing up on whitespace but on whitespace and the first letter of each word? With separator retention mode we know that we only have to affect every other item, so we count them as we do the map:
my $n = 0;
my $ending_string =
join '',
map { ++$n % 2 ? tr/aeiouAEIOU//dr : $_ }
split /((?:^|\s+)[a-z])/i, $x;
But, I wouldn't write this technique in this way because someone would ultimately find me and exact their revenge. Instead, that foreach I found annoying before may soothe the angry masses:
my $n = 0;
foreach ( split /((?:^|\s+)[a-z])/i, $x ) {
print ++$n % 2 ? tr/aeiouAEIOU//dr : $_;
}
This now remembers the actual separators from the original string and leaves alone the first character of the "word" because it's not in the element we will modify.
The code in the foreach doesn't need to use the conditional operator, ?: or some of the other features. The important part is skipping every other element. That split pattern is a bit of a puzzler if you haven't seen it before, but that's what you get with those sorts of requirements. I think modifying a portion of the substring is just as likely to trip up people on a first read.
I mean, if they are going to make you do it the wrong way in the homework, strike back with something that will take up a bit of their time. :)
Oh, this is fun
I had another idea, because tr/// has another task beyond transliteration. It also counts. Because it returns the number of replacements, if you replace anything with itself, you get a count of the occurrences of that thing. You can count vowels, for instance:
my $has_vowels = $string =~ tr/aeiou/aeiou/; # counts vowels
But, with a string of one letter, that means you have a way to tell if it is a vowel:
my $is_vowel = substr( $string, $i, 1 ) =~ tr/aeiou/aeiou/;
You also can know things about the previous character:
my $is_letter = substr( $string, $i - 1, 1 ) =~ tr/a-zA-Z/a-zA-Z/;
Put that together and you can look at any position and know if it's a vowel that follows a letter. If so, you skip that letter. Otherwise, you add that letter to the output:
use v5.10;
$x = "I like apples more than oranges oooooranges\n";
my $output = substr $x, 0, 1; # avoid the -1 trap (end of string!)
for( my $i = 1; $i < length $x; $i++ ) {
if( substr( $x, $i, 1 ) =~ tr/aeiou/aeiou/ ) { # is a vowel
next if substr( $x, $i - 1, 1 ) =~ tr/a-zA-Z/a-zA-Z/;
}
$output .= substr $x, $i, 1;
}
say $output;
This has the fun consequence of using the recommended operator but completely bypassing the intent. But, this is a proper and intended use of tr///.
It appears that you need to put a little more effort into learning Perl before taking on challenges like this. Your example contains a lot of code that simply isn't valid Perl.
$x = "I like apples more than oranges\n"; #the original sentence
foreach $i in #x[1..] {
You assign your text to the scalar variable $x, but then try to use the array variable #x. In Perl, these are two completely separate variables that have no connection whatsoever. Also, in Perl, the range operator (..) needs values at both ends.
If you had an array called #x (and you don't, you have a scalar) then you could do what you're trying to do here with foreach $i (#x)
if $i "AEIOUaeiou";
I'm not sure what you're trying to do here. I guess the nearest useful Perl expression I can see would be something like:
if ($i =~ /^[AEIOUaeiou]$/)
Which would test if $i is a vowel. But that's a regex, so you're not allowed to use it.
Obviously, I'd solve this problem with a regex, but as those are banned, I've reached for some slightly more obscure Perl features in my code below (that's so your teacher won't believe this is your solution if you just cut and paste it):
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $text = "I like apples more than oranges\n";
# Split the string into an array of words
my #words = split /\s+/, $text;
# For each word...
for (#words) {
# Get a substring that omits the first character
# and use tr/// to remove vowels from that substring
substr($_, 1) =~ tr/AEIOUaeiou//d;
}
# Join the array back together
$text = join ' ', #words;
say $text;
Update: Oh, and notice that I've used tr/AEIUOaeiou//d where you have tr/A E I O U a e i o u//d. It probably won't make any difference here (depending on your approach - but you'll probably be applying it to strings that don't contain spaces) but it's good practice to only include the characters that you want to remove.
We can go over the input string from the end and remove any vowel that's not preceded by a space. We go from right to left so we don't have to adjust the position after each deletion. We don't need to check the very first letter, it shouldn't be ever removed. To remove a vowel, we can use tr///d on the substr of the original string.
for my $i (reverse 1 .. length $x) {
substr($x, $i, 1) =~ tr/aeiouAEIOU//d
if substr($x, $i - 1, 1) ne ' ';
}
Firstly your if statement is wrong.
Secondly this is not a Perl code.
Here is a piece of code that will work, but there is a better way to do it
my $x = "I like apples more than oranges\n";
my $new = "";
my #arr;
foreach my $word (split(' ', $x)) {
#arr = split('', $word);
foreach (my $i; $i<scalar #arr; $i++){
if ($i == 0){
$new .= $arr[$i];
}
elsif (index("AEIOUaeiou", $arr[$i]) == -1) {
$new .= $arr[$i];
}
}
$new .= " ";
}
print "$new\n";
Here I am splitting the string in order to get an array, then I am checking if the given char is a vowel, if it's not, I am appending it to a new string.
Always include
use strict;
use warnings;
on top of your code.
Clearly this is an exercise in lvalues. Obviously. Indubitably!
#!/usr/bin/env perl
# any old perl will do
use 5.010;
use strict;
use warnings;
# This is not idomatic nor fantastic code. Idiotastic?
$_='I am yclept Azure-Orange, queueing to close a query. How are YOU?';
# My little paws typed "local pos" and got
# "Useless localization of match position" :(
# so a busy $b keeps/restores that value
while (/\b./g) {
substr($_,$b=pos,/\b/g && -$b+pos)
# Suggestion to use tr is poetic, not pragmatic,
# ~ tr is sometimes y and y is sometimes a vowel
=~ y/aeiouAEIOU//d;
pos=$b;
}
say
# "say" is the last word.
Was there an embargo against using s/// substitution, or against using all regex? For some reason I thought matching was OK, just not substitution. If matches are OK, I have an idea that "improves" upon this by removing $b through pattern matching side effects. Will see if it pans out. If not, should be pretty easy to replace /\b/ and pos with index and variables, though the definition of word boundary over-simplifies in that case.
(edit) here it is a little more legible with nary a regex
my $text="YO you are the one! The-only-person- asking about double spaces.
Unfortunate about newlines...";
for (my $end=length $text;
$end > 0 && (my $start = rindex $text,' ',$end);
$end = $start-1) {
# y is a beautiful letter, using it for vowels is poetry.
substr($text,2+$start,$end-$start) =~ y/aeiouUOIEA//d;
}
say $text;
Maybe more devious minds will succeed with vec, unpack, open, fork?
You can learn about some of these techniques via
perldoc -f substr
perldoc -f pos
perldoc re
As for my own implementer notes, the least important thing is ending without punctuation so nothing can go after

Finding index of white space in Perl

I'm trying to find the index of white space in a string in Perl.
For example, if I have the string
stuff/more stuffhere
I'd like to select the word "more" with a substring method. I can find the index of "/" but haven't figured out how to find the index of white space. The length of the substring I'm trying to select will vary, so I can't hard code the index. There will only be one white space in the string (other than those after the end of the string).
Also, if anybody has any better ideas of how to do this, I'd appreciate hearing them. I'm fairly new to programming so I'm open to advice. Thanks.
Just use index:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = 'stuff/more stuffhere';
my $index_of_slash = index $string, '/';
my $index_of_space = index $string, ' ';
say "Between $index_of_slash and $index_of_space.";
The output is
Between 5 and 10.
Which is correct:
0 1
01234567890123456789
stuff/more stuffhere
If by "whitespace" you also mean tabs or whatever, you can use a regular expression match and the special variables #- and #+:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = "stuff/more\tstuffhere";
if ($string =~ m{/.*(?=\s)}) {
say "Between $-[0] and $+[0]";
}
The (?=\s) means is followed by a whitespace character, but the character itself is not part of the match, so you don't need to do any maths on the returned values.
As you stated, you want to select the word between the first /
and the first space following it.
If this is the case, you maybe don't need any index (you need just
the word).
A perfect tool to find something in a text is regex.
Look at the following code:
$txt = 'stuff/more stuffxx here';
if ($txt =~ /\/(.+?) /) {
print "Match: $1.\n";
}
The regex used tries to match:
a slash,
a non-empty sequence of any chars (note ? - reluctant
version), enclosed in a capturing group,
a space.
So after the match $1 contains what was captured by the first
capturing group, i.e. "your" word.
But if for any reason you were interested in starting and ending
offsets to this word, you can read them from $-[1]
and $+[1] (starting / ending indices of the first capturing group).
The arrays #- (#LAST_MATCH_START) and #+ (#LAST_MATCH_END) give offsets of the start and end of last successful submatches. See Regex related variables in perlvar.
You can capture your real target, and then read off the offset right after it with $+[0]
#+
This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. $+[0] is the offset into the string of the end of the entire match. This is the same value as what the pos function returns when called on the variable that was matched against.
Example
my $str = 'target and target with spaces';
while ($str =~ /(target)\s/g)
{
say "Position after match: $+[0]"
}
prints
Position after match: 7
Position after match: 18
These are positions right after 'target', so of spaces that come after it.
Or you can capture \s instead and use $-[1] + 1 (first position of the match, the space).
You can use
my $str = "stuff/more stuffhere";
if ($str =~ m{/\K\S+}) {
... substr($str, $-[0], $+[0] - $-[0]) ...
}
But why substr? That's very weird there. Maybe if you told us what you actually wanted to do, we could provide a better alternatives. Here are three cases:
Data extraction:
my $str = "stuff/more stuffhere";
if ( my ($word) = $str =~ m{/(\S+)} ) {
say $word; # more
}
Data replacement:
my $str = "stuff/more stuffhere";
$str =~ s{/\K\S+}{REPLACED};
say $str; # stuff/REPLACED stuffhere
Data replacement (dynamic):
my $str = "stuff/more stuffhere";
$str =~ s{/\K(\S+)}{ uc($1) }e;
say $str; # stuff/MORE stuffhere

Skipping particular positions in a string using substitution operator in perl

Yesterday, I got stuck in a perl script. Let me simplify it, suppose there is a string (say ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD), first I've to break it at every position where "E" comes, and secondly, break it specifically where the user wants to be at. But, the condition is, program should not cut at those sites where E is followed by P. For example there are 6 Es in this sequence, so one should get 7 fragments, but as 2 Es are followed by P one will get 5 only fragments in the output.
I need help regarding the second case. Suppose user doesn't wants to cut this sequence at, say 5th and 10th positions of E in the sequence, then what should be the corresponding script to let program skip these two sites only? My script for first case is:
my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';
$otext=~ s/([E])/$1=/g; #Main cut rule.
$otext=~ s/=P/P/g;
#output = split( /\=/, $otext);
print "#output";
Please do help!
To split on "E" except where it's followed by "P", you should use Negative look-ahead assertions.
From perldoc perlre "Look-Around Assertions" section:
(?!pattern)
A zero-width negative look-ahead assertion.
For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar".
my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';
# E E EP E EP E
my #output=split(/E(?!P)/, $otext);
use Data::Dumper; print Data::Dumper->Dump([\#output]);"
$VAR1 = [
'ABCD',
'ABCD',
'ABCDEPABCD',
'ABCDEPABCD',
'ABCD'
];
Now, in order to NOT cut at occurences #2 and #4, you can do 2 things:
Concoct a really fancy regex that automatically fails to match on given occurence. I will leave that to someone else to attempt in an answer for completeness sake.
Simply stitch together the correct fragments.
I'm too brain-dead to come up with a good idiomatic way of doing it, but the simple and dirty way is either:
my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
my #output_final;
for(my $i=0; $i < #output; $i++) {
if ($no_cuts{$i}) {
$output_final[-1] .= $output[$i];
} else {
push #output_final, $output[$i];
}
}
print Data::Dumper->Dump([\#output_final];
$VAR1 = [
'ABCD',
'ABCDABCDEPABCD',
'ABCDEPABCDABCD'
];
Or, simpler:
my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
for(my $i=0; $i < #output; $i++) {
$output[$i-1] .= $output[$i];
$output[$i]=undef; # Make the slot empty
}
my #output_final = grep {$_} #output; # Skip empty slots
print Data::Dumper->Dump([\#output_final];
$VAR1 = [
'ABCD',
'ABCDABCDEPABCD',
'ABCDEPABCDABCD'
];
Here's a dirty trick that exploits two facts:
normal text strings never contain null bytes (if you don't know what a null byte is, you should as a programmer: http://en.wikipedia.org/wiki/Null_character, and nb. it is not the same thing as the number 0 or the character 0).
perl strings can contain null bytes if you put them there, but be careful, as this may screw up some perl internal functions.
The "be careful" is just a point to be aware of. Anyway, the idea is to substitute a null byte at the point where you don't want breaks:
my $s = "ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD";
my #nobreak = (4,9);
foreach (#nobreak) {
substr($s, $_, 1) = "\0";
}
"\0" is an escape sequence representing a null byte like "\t" is a tab. Again: it is not the character 0. I used 4 and 9 because there were E's in those positions. If you print the string now it looks like:
ABCDABCDABCDEPABCDEABCDEPABCDEABCD
Because null bytes don't display, but they are there, and we are going to swap them back out later. First the split:
my #a = split(/E(?!P)/, $s);
Then swap the zero bytes back:
$_ =~ s/\0/E/g foreach (#a);
If you print #a now, you get:
ABCDEABCDEABCDEPABCD
ABCDEPABCD
ABCD
Which is exactly what you want. Note that split removes the delimiter (in this case, the E); if you intended to keep those you can tack them back on again afterward. If the delimiter is from a more dynamic regex it is slightly more complicated, see here:
http://perlmeme.org/howtos/perlfunc/split_function.html
"Example 9. Keeping the delimiter"
If there is some possibility that the #nobreak positions are not E's, then you must also keep track of those when you swap them out to make sure you replace with the correct character again.

How do I count the "real" words in a text with Perl?

I've run into a text processing problem. I've an article, and I'd like to find out how many "real" words there are.
Here is what I mean by "real". Articles usually contain various punctuation marks such as dashes, and commas, dots, etc. What I'd like to find out is how many words there are, skipping like "-" dashes and "," commas with spaces, etc.
I tried doing this:
my #words = split ' ', $article;
print scalar #words, "\n";
But that includes various punctuations that have spaces in them as words.
So I'm thinking of using this:
my #words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar #words, "\n";
This would match all words that have characters or numbers in them. What do you think, would this be good enough way to count words in an article?
Does anyone know maybe of a module on CPAN that does this?
Try to use: \W - any non-word character, and also drop _
Solution
use strict;
my $article = 'abdc, dd_ff, 11i-11, ff44';
# case David's, but it didn't work with I'm or There's
$article =~ s/\'//g;
my $number_words = scalar (split /[\W_]+/, $article);
print $number_words;
I think your solution is about as good as you're going to get without resorting to something elaborate.
You could also write it as
my #words = $article =~ /\S*\w\S*/
or count the words in a file by writing
my $n = 0;
while (<>) {
my #words = /\S*\w\S*/g;
$n += #words;
}
say "$n words found";
Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.

Modify Perl script to output new line every 20 words

I'm using an open-source perl script to create a text corpus based on the English language wikipedia dump. The plain text has been extracted, but various punctuation marks and the like still need to be removed. However, the output of this script essentially creates a 7.2GiB text file containing a single line. Due to my needs, I want to alter the script such that it inserts a new line character every 20 words.
So far, I've tried this:
$wordCount=0;
while (<STDIN>) {
$wordCount++;
//text processing regex commands here
# Remove ellipses
s/\.\.\./ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\�,;:%�?�!()\[\]{}<>_\.])/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/'' / /g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/�/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/ / /g;
chomp($_);
if ($wordCount == 20){
print uc($_) . "\n";
$wordCount=0;
}
print uc($_) . " ";
}
print "\n";
However, this doesn't seem to work, as the raw output has only newlines scattered around arbitrarily. I'd like to have the text formatted so it will fit on a typical 1200px wide monitor without word wrapping.
A sample input text from the file is
The Concise Oxford Dictionary of Politics. Proponents of anarchism
(known as "anarchists") advocate stateless societies as the only moral
form of social organization. There are many types and traditions of
anarchism, not all of which are mutually exclusive. Anarchism as a
social movement has regularly endured fluctuations in popularity. The
term anarchism derives from the Greek ἄναρχος, anarchos, meaning
"without rulers", its use as a synonym is still common outside the
United States. The earliest anarchist themes can be found in the 6th
century BC, among the works of Taoist philosopher Laozi, and in later
centuries by Zhuangzi and Bao Jingyan. The term "anarchist" first
entered the English language in 1642, during the English Civil War, as
a term of abuse, used by Royalists against their Roundhead opponents.
By the time of the French Revolution some, such as the Enragés, began
to use the term positively, in opposition to Jacobin centralisation
of power, seeing "revolutionary government" as oxymoronic. By the
turn of the 19th century, the English word "anarchism" had lost its
initial negative connotation. Modern anarchism sprang from the secular
or religious thought of the Enlightenment, particularly Jean-Jacques
Rousseau's arguments for the moral centrality of freedom. Anarchism",
Encarta Online Encyclopedia 2006 (UK version). From this climate
William Godwin developed what many consider the first expression of
modern anarchist thought. Godwin was, according to Peter Kropotkin,
"the first to formulate the political and economical conceptions of
anarchism, even though he did not give that name to the idea s
developed in his work", while Godwin attached his anarchist ideas to
an early Edmund Burke. The anarcho-communist Joseph Déjacque was the
first person to describe himself as "libertarian". Unlike Proudhon, he
argued that, "it is not the product of his or her labor that the
worker has a right to, but to the satisfaction of his or her needs,
whatever may be t heir nature. Jesus is sometimes considered the first
anarchist in the Christian anarchist tradition. Georges Lechartier
wrote that "The true founder of anarchy was Jesus Christ and . In
Europe, harsh reaction followed the revolutions of 1848, during which
ten countries had experienced brief or long-term social upheaval as
groups carried out nationalis t uprisings. After most of these
attempts at systematic change ended in failure, conservative elements
took advantage of the divided groups of socialists, anarchists,
liberals, and na tionalists, to prevent further revolt. Blanquists,
Philadelphes, English trade unionists, socialists and social
democrats. Due to its links to active workers' movements, the
International became a significant organization. Karl Marx became a
leading figure in the International and a member of its General
Council. Proudhon's followers, the mutualists, opposed Marx's state
socialism, advocating political abstentionism and small property
holdings. In 1868, following their unsuccessful participation in the
League of Peace and Freedom (LPF), Russian revolutionary Mikhail
Bakunin and his collectivist anarchist associa tes joined the First
International (which had decided not to get involved with the LPF). At
first, the collectivists worked with the Marxists to push the First
International in a more revolutionary socialist direction.
Subsequently, the International became polarised into two camps, with
Marx and Bakunin as their respective figureheads. In 1872, the
conflict climaxed with a final split between the two groups at the
Hague Congress, where Bakunin and James Guillaume were expelled from
the International and its headquarters were transferred to New York.
In response, the federalist sections formed their own International at
the St. Imier Congress, adopting a revolutionary anarchist program.
Black Rose Books 2005) ISBN 1-55164-251-4.
There's 7-something gigs worth of text in the file. So using a list or other data structure might be a bit of overkill for these requirements.
What is needed in order to fit my requirements?
Consider using something like Text::Wrap or Text::Autoformat .
open my $in, '<', $inFileName;
open my $out, '>', $outFileName;
my $wordcount = 0;
while(defined( my $line = <$in> )){
$line=~s/\n//g; #remove newline character
#split the words into an array(could use '\W+' instead of ' ')
my #words = split ' ', $line;
foreach my $word (#words){
$wordCount++;
if ($wordCount == 20){
$wordCount = 0;
print $out "\n";
}
else {
print $out uc($word)." ";
}
} # end of foreach line in input
} # end of file while loop
close $in;
close $out;
First, set perl's input record separator to something frequent and useful, like a space:
$/ = ' ';
then loop over the input word by word:
while (<>) {
trim the word:
s/^\s+|\s+$//g;
skip it if it was all space:
$_ or next;
do any other transforms you need
and then add it to a stack, splitting any internal tabs or other space-like characters:
push #words, split /\s+/;
next, check to see if you have 20 words, and if so, print them:
print join(' ' => splice #words, 0, 20), "\n" while #words >= 20;
}
then print anything remaining:
print "#words\n" if #words;
Without knowing more details about this problem, I'd suggest a brute force solution:
slurp the entire entry,
split to an array based on " ",
foreach the array and print "\n" after every 20 elements.
True to Perl, there are various ways to solve this, but one (perverse?!) way to do it is to read the file byte by byte instead of line by line, or slurping the whole thing in. It's rather brute force-ish but it works. Essentially you are trading memory use for disk usage.
#!/usr/bin/perl -w
use strict;
open(IN, "in.txt") or die;
my $rc = 1;
my $wc = 0;
my $new;
while ($rc != 0)
{
# Read a byte - not safe for Unicode or other double-byte environments!
$rc = read IN, $new, 1, 0;
# We're only interested if the byte isn't punctuation (POSIX character class).
if ($new !~ m/[[:punct:]]/)
{
# word boundary?
if ($new =~ m/ /)
{
$wc++;
if ($wc % 20 == 0)
{
print "\n"; # 20th word, time for a new line.
}
}
print $new;
}
# move on to the next byte
seek IN, 0, 1;
}
close(IN);