Error when extracting words offset in perl

Error when extracting words offset in perl - perl

I have a program that read tow files, the first file contain terms (one, or multiterms) seprated by semicolon (; ), the second file contain text, the goal is to determine the offset of the terms in the first file!
My program begin well fluctuating vacuum (correct offsset extracted 2 20, also 45 59 for quantum fields) , but when extracting the offset for example for terms nuclear physics (correct ofsset 396 411) my code generate 399 414! or Fermionic fields (my code generate 138 154) but the correct is 135 151
The code used is:
#!/usr/bin/perl
use strict;use warnings;
my #a = ();
my #b = ();
my #aa = ();
my $l=0;
my $v=1;
my $g=0;
my $kh;
my $ligne2;
my $texte;
open(FICC, $ARGV[0]);
print "choose the name of the file\n";
my $fic = <STDIN>;
open(FIC1, ">$fic");
while (<FICC>) {
my $ligne2=$_;
$a[$l]=$ligne2;
$l++;
}
my $aa;
my $ligne;
my $rep = "C:\\scienceie2017_train\\train2";
opendir(REP,$rep) or die "E/S : $!\n";
foreach my $kh (#a) {
chomp($kh);
if ($kh=~/.*\.txt/) {
$texte=$kh;
#print "$kh";
print FIC1 "$texte";
}
#aa=split(/;/,$kh);
#$u++;
while(defined(my $fic=readdir REP)){
my $f="${rep}\\$texte";
open FIC, "$f" or warn "$f E/S: $!\n";
while(<FIC>){
$ligne = $_;
chomp($ligne);
#print FIC1 "#aa";
foreach my $che (#aa) {
$che =~ s/^\s+//;
$che =~ s/\s+$//;
if ($ligne =~/\Q$che\E/) {
print FIC1 "T$v\tTask $-[0] $+[0]\t$che\n";
$v++;
}
}
$v = 1;
}
print FIC1 "\n";
close FIC;
goto che
}
che:
}
The text is:
A fluctuating vacuum is a general feature of quantum fields, of which the free Maxwell field considered in  [1–12] is but one example. Fermionic fields such as that describing the electron, also undergo vacuum fluctuations, consequently one expects to find Casimir effects associated with such fields whenever they are confined in some way. Such effects were first investigated in the context of nuclear physics, within the so-called “MIT bag model” of the nucleon  [13]. In the bag-model one envisages the nucleon as a collection of fermionic fields describing confined quarks. These quarks are subject to a boundary condition at the surface of the ‘bag’ that represents the nucleon’s surface. Just as in the electromagnetic case, the bag boundary condition modifies the vacuum fluctuations of the field, which results in the appearance of a Casimir force  [14–18]. This force, although very weak at a macroscopic scale, can be significant on the small length scales encountered in nuclear physics. It therefore has important consequences for the physics of the bag-model nucleon  [19].
The extracted terms are:
fluctuating vacuum;general feature;quantum fields;free Maxwell;free Maxwell field;Maxwell;Maxwell field;Maxwell field;Maxwell field;field considered in ;considered in ;1–12;Fermionic fields;vacuum fluctuations;Casimir;Casimir effects;Casimir effects;Casimir effects;such fields;Such effects;nuclear physics;so-called “MIT;so-called “MIT bag;“MIT bag;“MIT bag model”;bag model”;fermionic fields;fermionic fields describing;boundary condition;nucleon’s surface;electromagnetic case;bag boundary;bag boundary condition;boundary condition;vacuum fluctuations;Casimir;Casimir force ;force ;14–18;macroscopic scale;small length;small length scales;length scales;nuclear physics;important consequences;bag-model nucleon ;

You must open your filesin UTF8
reaplce
open FIC, "$f" or warn "$f E/S: $!\n";
by
open FIC, "<:encoding(UTF-8)", "$f" or warn "$f E/S: $!\n";

I'm not clear on your code, but when I run your provided data with my code, I get these results.
The 2 variables #- and #+, ($-[0] and $+[0]) are described in the Variables-related-to-regular-expressions. (LAST_MATCH_START & LAST_MATCH_END)
My code:
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'A fluctuating vacuum is a general feature ... (rest of line)';
my #terms = split /;/, 'fluctuating vacuum;Fermionic fields;nuclear physics;bag-model nucleon';
for my $term (#terms) {
while ($s =~ /$term/g) {
print "$-[0] - $+[0] $term\n";
}
}
Output:
2 - 20 fluctuating vacuum
135 - 151 Fermionic fields
396 - 411 nuclear physics
983 - 998 nuclear physics
1063 - 1080 bag-model nucleon

#!/usr/bin/perl
$string = "A fluctuating vacuum is a general feature of quantum fields, of which the free Maxwell field considered in [1–12] is but one example. Fermionic fields such as that describing the electron, also undergo vacuum fluctuations, consequently one expects to find Casimir effects associated with such fields whenever they are confined in some way. Such effects were first investigated in the context of nuclear physics, within the so-called “MIT bag model” of the nucleon [13]. In the bag-model one envisages the nucleon as a collection of fermionic fields describing confined quarks. These quarks are subject to a boundary condition at the surface of the ‘bag’ that represents the nucleon’s surface. Just as in the electromagnetic case, the bag boundary condition modifies the vacuum fluctuations of the field, which results in the appearance of a Casimir force [14–18]. This force, although very weak at a macroscopic scale, can be significant on the small length scales encountered in nuclear physics. It therefore has important consequences for the physics of the bag-model nucleon [19].";
#extracted_terms = ( "fluctuating vacuum", "Fermionic fields", "nuclear physics", "bag-model nucleon" );
for my $term ( #extracted_terms )
{
$position = index $string, $term;
printf ( "%s, %s\n", $position, $position + length($term) );
}

Related

Perl: break down a string, with some unique constraints

I'm using Perl to feed data to an LCD display. The display is 8 characters wide. The strings of data to be displayed are always significantly longer than 8 characters. As such, I need to break the strings down into "frames" of 8 characters or less, and feed the "frames" to the display one at a time.
The display is not intelligent enough to do this on its own. The only convenience it offers is that strings of less than 8 characters are automatically centered on the display.
In the beginning, I simply sent the string 8 characters at a time - here goes 1-8, now 9-16, now 17-24, etc. But that wasn't especially nice-looking. I'd like to do something better, but I'm not sure how best to approach it.
These are the constraints I'd like to implement:
Fit as many words into a "frame" as possible
No starting/trailing space(s) in a "frame"
Symbol (ie. hyphen, ampersand, etc) with a space on both sides qualifies as a word
If a word is longer than 8 characters, simulate per-character scrolling
Break words longer than 8 characters at a slash or hyphen
Some hypothetical input strings, and desired output for each...
Electric Light Orchestra - Sweet Talkin' Woman
Electric
Light
Orchestr
rchestra
- Sweet
Talkin'
Woman
Quarterflash - Harden My Heart
Quarterf
uarterfl
arterfla
rterflas
terflash
- Harden
My Heart
Steve Miller Band - Fly Like An Eagle
Steve
Miller
Band -
Fly Like
An Eagle
Hall & Oates - Did It In A Minute
Hall &
Oates -
Did It
In A
Minute
Bachman-Turner Overdrive - You Ain't Seen Nothing Yet
Bachman-
Turner
Overdriv
verdrive
- You
Ain't
Seen
Nothing
Yet
Being a relative Perl newbie, I'm trying to picture how would be best to handle this. Certainly I could split the string into an array of individual words. From there, perhaps I could loop through the array, counting the letters in each subsequent word to build the 8-character "frames". Upon encountering a word longer than 8 characters, I could then repetitively call substr on that word (with offset +1 each time), creating the illusion of scrolling.
Is this a reasonable way to accomplish my goal? Or am I reinventing the wheel here? How would you do it?

The base question is to find all consecutive overlapping N-long substrings in a compact way.
Here it is in one pass with a regex, and see the end for doing it using substr.
my $str = join '', "a".."k"; # 'Quarterflash';
my #eights = $str =~ /(?=(.{8}))/g;
This uses a lookahead which also captures, and in this way the regex crawls up the string character by character, capturing the "next" eight each time.
Once we are at it, here is also a basic solution for the problem. Add words to a buffer until it would exceed 8 characters, at which point it is added to an array of display-ready strings and cleared.
use warnings;
use strict;
use feature 'say';
my $str = shift // "Quarterflash - Harden My Heart";
my #words = split ' ', $str;
my #to_display;
my $buf = '';
foreach my $w (#words) {
if (length $w > 8) {
# Now have to process the buffer first then deal with this long word
push #to_display, $buf;
$buf = '';
push #to_display, $w =~ /(?=(.{8}))/g;
}
elsif ( length($buf) + 1 + length($w) > 8 ) {
push #to_display, $buf;
$buf = $w;
}
elsif (length $buf != 0) { $buf .= ' ' . $w }
else { $buf = $w }
}
push #to_display, $buf if $buf;
say for #to_display;
This is clearly missing some special/edge cases, in particular those involving non-word characters and hyphenated words, but that shouldn't be too difficult to add.†
Here is a way to get all consecutive 8-long substrings using substr
my #to_display = map { substr $str, $_, 8 } 0..length($str)-8;
† Example, break a word with hyphen/slash when it has no spaces around it (per question)
my #parts = split m{\s+|(?<=\S)[-/](?=\S)}, $w;
The hyphen/slash is discarded as this stands; that can be changed by capturing the pattern as well and then filtering out elements with only spaces
my #parts = grep { /\S/ } split m{( \s+ | (?<=\S) [-/] (?=\S) )}x, $w;
These haven't been tested beyond just barely. Can fit in the if (length $w > 8) branch.
The initial take-- The regex was originally written with a two-part pattern. Keeping it here for record and as an example of use of pair-handling functions from List::Util
The regex below matches and captures a character, followed by a lookahead for the next seven, which it also captures. This way the engine captures 1 and 7-long substrings as it moves along char by char. Then the consecutive pairs from the returned list are joined
my $str = join '', "a".."k"; # 'Quarterflash';
use List::Util qw(pairmap);
my #eights = pairmap { $a . $b } $str =~ /(. (?=(.{7})) )/gx;
# or
# use List::Util qw(pairs);
# my #eights = map { join '', #$_ } pairs $str =~ /(.(?=(.{7})))/g;

Shortest Perl solution for outputing 4 random words

I have this one-line Unix shell script
for i in 1 2 3 4; do sed "$(tr -dc '0-9' < /dev/urandom | fold -w 5 |
awk '$0>=35&&$0<=65570' | head -1)q;d" "$0"; done | perl -p00e
's/\n(?!\Z)/ /g'
The script has 65K words in it, one per line, from line 35 to 65570. The code and the data are in the same file.
This script outputs 4 space-separated random words from this list with a newline at the end. For example
first fourth third second
How can I make this one-liner much shorter with Perl, keeping the
tr -dc '0-9' < /dev/urandom
part?
Keeping it is important since it provides Cryptographically Secure Pseudo-Random Numbers (CSPRNs) for all Unix OSs. Of course, if Perl can get numbers from /dev/urandom then the tr can be replaced with Perl too, but the numbers from urandom need to stay.
For convenience, I shared the base script with 65K words
65kwords.txt
or
65kwords.txt
Please use only core modules. It would be used for generating "human memorable passwords".
Later, the (hashing) iteration count, where we would use this to store the passwords would be extremely high, so brute-force would be very slow, even with many many GPUs/FPGAs.

You mention needing a CSPRN, which makes this a non trivial exercise - if you need cryptographic randomness, then using built in stuff (like rand) is not a good choice, as the implementation is highly variable across platforms.
But you've got Rand::Urandom which looks like it does the trick:
By default it uses the getentropy() (only available in > Linux 3.17) and falls back to /dev/arandom then /dev/urandom.
#!/usr/bin/env perl
use strict;
use warnings;
use Rand::Urandom;
chomp ( my #words = <DATA> );
print $words[rand #words], " " for 1..4;
print "\n";
__DATA__
yarn
yard
wound
worst
worry
work
word
wool
wolf
wish
wise
wipe
winter
wing
wind
wife
whole
wheat
water
watch
walk
wake
voice
Failing that though - you can just read bytes from /dev/urandom directly:
#!/usr/bin/env perl
use strict;
use warnings;
my #number_of_words = 4;
chomp ( my #words = <DATA> );
open ( my $urandom, '<:raw', '/dev/urandom' ) or die $!;
my $bytes;
read ( $urandom, $bytes, 2 * $number_of_words ); #2 bytes 0 - 65535
#for testing
#unpack 'n' is n An unsigned short (16-bit)
# unpack 'n*' in a list context returns a list of these.
foreach my $value ( unpack ( "n*", $bytes ) ) {
print $value,"\n";
}
#actually print the words.
#note - this assumes that you have the right number in your list.
# you could add a % #words to the map, e.g. $words[$_ % #words]
#but that will mean wrapping occurs, and will alter the frequency distribution.
#a more robust solution would be to fetch additional bytes if the 'slot' is
#empty.
print join " ", ( map { $words[$_] } unpack ( "n*", $bytes )),"\n";
__DATA__
yarn
yard
wound
worst
#etc.
Note - the above relies on the fact that your wordlist is the same size as two bytes (16 bits) - if this assumption isn't true, you'll need to deal with 'missed' words. A crude approach would be to take a modulo, but that would mean some wrapping and therefore not quite truly even distribution of word picks. Otherwise you can bit-mask and reroll, as indicated below:
On a related point though - have you considered not using a wordlist, and instead using consonant-vowel-consonant groupings?
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
#uses /dev/urandom to fetch bytes.
#generates consonant-vowel-consonant groupings.
#each are 11.22 bits of entropy, meaning a 4-group is 45 bits.
#( 20 * 6 * 20 = 2400, which is 11.22 bits of entropy log2 2400
#log2(2400 ^ 4) = 44.91
#but because it's generated 'true random' it's a know entropy string.
my $num = 4;
my $format = "CVC";
my %letters = (
V => [qw ( a e i o u y )],
C => [ grep { not /[aeiouy]/ } "a" .. "z" ], );
my %bitmask_for;
foreach my $type ( keys %letters ) {
#find the next power of 2 for the number of 'letters' in the set.
#So - for the '20' letter group, that's 31. (0x1F)
#And for the 6 letter group that's 7. (0x07)
$bitmask_for{$type} = ( 2 << log ( #{$letters{$type}} ) / log 2 ) - 1 ;
}
open( my $urandom, '<:raw', '/dev/urandom' ) or die $!;
for ( 1 .. $num ) {
for my $type ( split //, $format ) {
my $value;
while ( not defined $value or $value >= #{ $letters{$type} } ) {
my $byte;
read( $urandom, $byte, 1 );
#byte is 0-255. Our key space is 20 or 6.
#So rather than modulo, which would lead to an uneven distribution,
#we just bitmask and discard and 'too high'.
$value = (unpack "C", $byte ) & $bitmask_for{$type};
}
print $letters{$type}[$value];
}
print " ";
}
print "\n";
close($urandom);
This generates 3 character CVC symbols, with a known entropy level (11.22 per 'group') for making reasonably robust passwords. (45 bits as opposed to the 64 bits of your original, although obviously you can add extra 'groups' to gain 11.22 bits per time).

This answer is not cryptographically safe!
I would do this completely in Perl. No need for a one-liner. Just grab your word-list and put it into a Perl program.
use strict;
use warnings;
my #words = qw(
first
second
third
fourth
);
print join( q{ }, map { $words[int rand #words] } 1 .. 4 ), "\n";
This grabs four random words from the list and outputs them.
rand #words evaluates #words in scalar context, which gives the number of elements, and creates a random floating point value between 0 and smaller than that number. int cuts off the decimals. This is used as the index to grab an element out of #words. We repeat this four times with the map statement, where the 1 .. 4 is the same as passing a list of (1, 2, 3, 4) into map as an argument. This argument is ignored, but instead our random word is picked. map returns a list, which we join on one space. Finally we print the resulting string, and a newline.
The word list is created with the quoted words qw() operator, which returns a list of quoted words. It's shorthand so you don't need to type all the quotes ' and commas ,.
If you'd want to have the word list at the bottom you could either put the qw() in a sub and call it at the top, or use a __DATA__ section and read from it like a filehandle.

The particular method using tr and fold on /dev/urandom is a lot less efficient than it could be, so let's fix it up a little bit, while keeping the /dev/urandom part.
Assuming that available memory is enough to contain your script (including wordlist):
chomp(#words = <DATA>);
open urandom, "/dev/urandom" or die;
read urandom, $randbytes, 4 * 2 or die;
print join(" ", map $words[$_], unpack "S*", $randbytes), "\n";
__DATA__
word
list
goes
here
This goes for brevity and simplicity without outright obfuscation — of course you could make it shorter by removing whitespace and such, but there's no reason to. It's self-contained and will work with several decades of perls (yes, those bareword filehandles are deliberate :-P)
It still expects exactly 65536 entries in the wordlist, because that way we don't have to worry about introducing bias to the random number choice using a modulus operator. A slightly more ambitious approach might be to read 48 bytes from urandom for each word, turning it into a floating-point value between 0 and 1 (portable to most systems) and multiplying it by the size of the word list, allowing for a word list of any reasonable size.

A lot of nonsense is talked about password strength, and I think you're overestimating the worth of several of your requirements here
I don't understand your preoccupation with making your code "much shorter with perl". (Why did you pick Perl?) Savings here can only really be useful to make the script quicker to read and compile, but they will be dwarfed by the half megabyte of data following the code which must also be read
In this context, the usefulness to a hacker of a poor random number generator depends on prior knowledge of the construction of the password together with the passwords that have been most recently generated. With a sample of only 65,000 words, even the worst random number generator will show insignificant correlation between successive passwords
In general, a password is more secure if it is longer, regardless of its contents. Forming a long password out of a sequence of English words is purely a way of making the sequence more memorable
"Of course later, the (hashing) iteration count ... would be extreme high, so brute-force [hacking?] would be very slow"
This doesn't follow at all. Cracking algorithms won't try to guess the four words you've chosen: they will see only a thirty-character (or so) string consisting only of lower-case letters and spaces, and whose origin is insignificant. It will be no more or less crackable than any other password of the same length with the same character set
I suggest that you should rethink your requirements and so make things easier for yourself. I don't find it hard to think of four English words, and don't need a program to do it for me. Hint: pilchard is a good one: they never guess that!
If you still insist, then I would write something like this in Perl. I've used only the first 18 lines of your data for
use strict;
use warnings 'all';
use List::Util 'shuffle';
my #s = map /\S+/g, ( shuffle( <DATA> ) )[ 0 .. 3 ];
print "#s\n";
__DATA__
yarn
yard
wound
worst
worry
work
word
wool
wolf
wish
wise
wipe
winter
wing
wind
wife
whole
wheat
output
wind wise winter yarn

You could use Data::Random::rand_words()
perl -MData::Random -E 'say join $/, Data::Random::rand_words(size => 4)'

Count subsequences in hundreds of GB of data

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.
To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb
Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:
abcde => 1
Next line, it moves one character to the right and does the same:
bcdef => 1
It then continues for the rest of the string and prints the 5 most frequent sequences:
open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'
my $seq = <$in>; # read whole file into string
my $len = length($seq);
my $seq_length = 5; # set k-mer length
my %data;
for (my $i = 0; $i <= $len - $seq_length; $i++) {
my $kmer = substr($seq, $i, $seq_length);
$data{$kmer}++;
}
# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
print "$kmer $data{$kmer}\n";
$count++;
last if $count >= 5;
}
ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1
However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.
I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.
My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.
Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
Is substr the most memory efficient tool for this task?

From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once
This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory
The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer
After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file
I've also use constants for the K-mer size and the chunk size. They are constant after all!
The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort
use strict;
use warnings 'all';
use constant SEQ_LENGTH => 5; # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB
my $in_file = shift // 'in.txt';
open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};
my %data;
my $chunk;
my $length = 0;
while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {
$length += $size;
for my $offset ( 0 .. $length - SEQ_LENGTH ) {
my $kmer = substr $chunk, $offset, SEQ_LENGTH;
++$data{$kmer};
}
$chunk = substr $chunk, -(SEQ_LENGTH-1);
$length = length $chunk;
}
my #kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for #kmers[0..4];
output
ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1
Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.
The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.

Even if you don't read the entire file into memory before processing it, you could still run out of memory.
A 10 GiB file contains almost 11E9 sequences.
If your sequences are sequences of 5 characters chosen from a set of 5 characters, there are only 55 = 3,125 unique sequences, and this would easily fit in memory.
If your sequences are sequences of 20 characters chosen from a set of 5 characters, there are 520 = 95E12 unique sequences, so the all 11E9 sequences of a 10 GiB file could unique. That does not fit in memory.
In that case, I suggest doing the following:
Create a file that contains all the sequences of the original file.
The following reads the file in chunks rather than all at once. The tricky part is handling sequences that span two blocks. The following program uses sysread[1] to fetch an arbitrarily-sized chunk of data from the file and append it to the last few character of the previously read block. This last detail allows sequences that span blocks to be counted.
perl -e'
use strict;
use warnings qw( all );
use constant SEQ_LENGTH => 20;
use constant CHUNK_SIZE => 1024 * 1024;
my $buf = "";
while (1) {
my $size = sysread(\*STDIN, $buf, CHUNK_SIZE, length($buf));
die($!) if !defined($size);
last if !$size;
for my $offset ( 0 .. length($buf) - SEQ_LENGTH ) {
print(substr($buf, $offset, SEQ_LENGTH), "\n");
}
substr($buf, 0, -(SEQ_LENGTH-1), "");
}
' <in.txt >sequences.txt
Sort the sequences.
sort sequences.txt >sorted_sequences.txt
Count the number of instances of each sequeunces, and store the count along with the sequences in another file.
perl -e'
use strict;
use warnings qw( all );
my $last = "";
my $count;
while (<>) {
chomp;
if ($_ eq $last) {
++$count;
} else {
print("$count $last\n") if $count;
$last = $_;
$count = 1;
}
}
' sorted_sequences.txt >counted_sequences.txt
Sort the sequences by count.
sort -rns counted_sequences.txt >sorted_counted_sequences.txt
Extract the results.
perl -e'
use strict;
use warnings qw( all );
my $last_count;
while (<>) {
my ($count, $seq) = split;
last if $. > 5 && $count != $last_count;
print("$seq $count\n");
$last_count = $count;
}
' sorted_counted_sequences.txt
This also prints ties for 5th place.
This can be optimized by tweaking the parameters passed to sort[2], but it should offer decent performance.
sysread is faster than previously suggested read since the latter performs a series of 4 KiB or 8 KiB reads (depending on your version of Perl) internally.
Given the fixed-length nature of the sequence, you could also compress the sequences into ceil(log256(520)) = 6 bytes then base64-encode them into ceil(6 * 4/3) = 8 bytes. That means 12 fewer bytes would be needed per sequence, greatly reducing the amount to read and to write.
Portions of this answer was adapted from content by user:622310 licensed under cc by-sa 3.0.

Generally speaking Perl is really slow at character-by-character processing solutions like those posted above, it's much faster at something like regular expressions since essentially your overhead is mainly how many operators you're executing.
So if you can turn this into a regex-based solution that's much better.
Here's an attempt to do that:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; for my $pos (0..4) { $str =~ s/^.// if $pos; say for $str =~ m/(.{5})/g }'|sort|uniq -c|sort -nr|head -n 5
3 ebfeb
2 febfe
2 bfebf
1 gbacb
1 fabcg
I.e. we have our string in $str, and then we pass over it 5 times generating sequences of 5 characters, after the first pass we start chopping off a character from the front of the string. In a lot of languages this would be really slow since you'd have to re-allocate the entire string, but perl cheats for this special case and just sets the index of the string to 1+ what it was before.
I haven't benchmarked this but I bet something like this is a much more viable way to do this than the algorithms above, you could also do the uniq counting in perl of course by incrementing a hash (with the /e regex option is probably the fastest way), but I'm just offloading that to |sort|uniq -c in this implementation, which is probably faster.
A slightly altered implementation that does this all in perl:
$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; my %occur; for my $pos (0..4) { substr($str, 0, 1) = "" if $pos; $occur{$_}++ for $str =~ m/(.{5})/gs }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { say "$occur{$k} $k" }'
3 ebfeb
2 bfebf
2 febfe
1 caebf
1 cgbac
1 bdbbc
1 acbde
1 efabc
1 aebfe
1 ebdbb
1 fabcg
1 bacbd
1 bcdef
1 cbdeb
1 defab
1 debdb
1 gbacb
1 bdebd
1 cdefa
1 bbcae
1 bcgba
1 bcaeb
1 abcgb
1 abcde
1 dbbca
Pretty formatting for the code behind that:
my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb";
my %occur;
for my $pos (0..4) {
substr($str, 0, 1) = "" if $pos;
$occur{$_}++ for $str =~ m/(.{5})/gs;
}
for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) {
say "$occur{$k} $k";
}

The most straightforward approach is to use the substr() function:
% time perl -e '$/ = \1048576;
while ($s = <>) { for $i (0..length $s) {
$hash{ substr($s, $i, 5) }++ } }
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n"; $it++; last if $it == 5;}' nucleotide.data
NNCTA 337530
GNGGA 337362
NCACT 337304
GANGN 337290
ACGGC 337210
269.79 real 268.92 user 0.66 sys
The Perl Monks node on iterating along a string was a useful resource, as were the responses and comments from #Jonathan Leffler, #ÆvarArnfjörðBjarmason, #Vorsprung, #ThisSuitIsBlackNotm #borodin and #ikegami here in this SO posting. As was pointed out, the issue with very large files is memory, which in turn requires that files be read in chunks. When reading from a file in chunks, if your code is iterating through the data it has to properly handle switching from one chunk/source to the next without dropping any bytes.
As a simplistic example, next unless length $kmer == 5; will get checked during each 1048576 byte/character iteration in the script above, meaning strings that exist at the end of one chunk and the beginning of another will be missed (cf. #ikegami's and #Borodin's solutions). This will alter the resulting count, though perhaps not in a statistically significant way[1]. Both #borodin and #ikegami address the issue of missing/overlapping strings between chunks by appending each chunk to the remaining characters of the previous chunk as they sysread in their while() loops. See Borodin's response and comments for an explanation of how it works.
Using Stream::Reader
Since perl has been around for quite a while and has collected a lot of useful code, another perfectly valid approach is to look for a CPAN module that achieves the same end. Stream::Reader can create a "stream" interface to a file handle that wraps the solution to the chunking issue behind a set of convenient functions for accessing the data.
use Stream::Reader;
use strict;
use warnings;
open( my $handler, "<", shift );
my $stream = Stream::Reader->new( $handler, { Mode => "UB" } );
my %hash;
my $string;
while ($stream->readto("\n", { Out => \$string }) ) {
foreach my $i (0..length $string) {
$hash{ substr($string, $i, 5) }++
}
}
my $it;
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash ) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
On a test data file nucleotide.data, both Borodin's script and the Stream::Reader approach shown above produced the same top five results. Note the small difference compared to the results from the shell command above. This illustrates the need to properly handle reading data in chunks.
NNCTA 337530
GNGGA 337362
NCACT 337305
GANGN 337290
ACGGC 337210
The Stream::Reader based script was significantly faster:
time perl sequence_search_stream-reader.pl nucleotide.data
252.12s
time perl sequence_search_borodin.pl nucleotide.data
350.57s
The file nucleotide.data was a 1Gb in size, consisting of single string of approximately 1 billion characters:
% wc nucleotide.data
0 0 1048576000 nucleotide.data
% echo `head -c 20 nucleotide.data`
NCCANGCTNGGNCGNNANNA
I used this command to create the file:
perl -MString::Random=random_regex -e '
open (my $fh, ">>", "nucleotide.data");
for (0..999) { print $fh random_regex(q|[GCNTA]{1048576}|) ;}'
Lists and Strings
Since the application is supposed to read a chunk at a time and move this $seq_length sized window along the length of the data building a hash for tracking string frequency, I thought a "lazy list" approach might work here. But, to move a window through a collection of data (or slide as with List::Gen) reading elements natatime, one needs a list.
I was seeing the data as one very long string which would first have to be made into a list for this approach to work. I'm not sure how efficient this can be made. Nevertheless, here is my attempt at a "lazy list" approach to the question:
use List::Gen 'slide';
$/ = \1048575; # Read a million character/bytes at a time.
my %hash;
while (my $seq = <>) {
chomp $seq;
foreach my $kmer (slide { join("", #_) } 5 => split //, $seq) {
next unless length $kmer == 5;
$hash{$kmer}++;
}
}
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
print "$k $hash{$k}\n";
$it++; last if $it == 5;
}
I'm not sure this is "typical perl" (TIMTOWDI of course) and I suppose there are other techniques (cf. gather/take) and utilities suitable for this task. I like the response from #Borodin best since it seems to be the most common way to take on this task and is more efficient for the potentially large file sizes that were mentioned (100Gb).
Is there a fast/best way to turn a string into a list or object? Using an incremental read() or sysread() with substr wins on this point, but even with sysread a 1000Gb string would require a lot of memory just for the resulting hash. Perhaps a technique that serialized/cached the hash to disk as it grew beyond a certain size would work with very, very large strings that were liable to create very large hashes.
Postscript and Results
The List::Gen approach was consistently between 5 and 6 times slower than #Borodin's approach. The fastest script used the Stream::Reader module. Results were consistent and each script selected the same top five strings with the two smaller files:
1 million character nucleotide string
sequence_search_stream-reader.pl : 0.26s
sequence_search_borodin.pl : 0.39s
sequence_search_listgen.pl : 2.04s
83 million character nucleotide string
With the data in file xaa:
wc xaa
0 1 83886080 xaa
% time perl sequence_search_stream-reader.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
21.33 real 20.95 user 0.35 sys
% time perl sequence_search_borodin.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
28.13 real 28.08 user 0.03 sys
% time perl sequence_search_listgen.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
157.54 real 156.93 user 0.45 sys
1 billion character nucleotide string
In a larger file the differences were of similar magnitude but, because as written it does not correctly handle sequences spanning chunk boundaries, the List::Gen script had the same discrepancy as the shell command line at the beginning of this post. The larger file meant a number of chunk boundaries and a discrepancy in the count.
sequence_search_stream-reader.pl : 252.12s
sequence_search_borodin.pl : 350.57s
sequence_search_listgen.pl : 1928.34s
The chunk boundary issue can of course be resolved, but I'd be interested to know about other potential errors or bottlenecks that are introduced using a "lazy list" approach. If there were any benefit in terms of CPU usage from using slide to "lazily" move along the string, it seems to be rendered moot by the need to make a list out of the string before starting.
I'm not surprised that reading data across chunk boundaries is left as an implementation exercise (perhaps it cannot be handled "magically") but I wonder what other CPAN modules or well worn subroutine style solutions might exist.
1. Skipping four characters - and thus four 5 character string combinations - at the end of each megabyte read of a terabyte file would mean the results would not include 3/10000 of 1% from the final count.
echo "scale=10; 100 * (1024^4/1024^2 ) * 4 / 1024^4 " | bc
.0003814697

Perl - Parse text file with tags for data dump into new text file

I was given data in a .txt file that I need to format into something I can upload into a database. The text is anchored with whatever . Depending on the tag the data needs to be dumped into a specific txt files and tab delimited. I've done very little Perl in my life, but I know Perl can handle this type of application easily, I'm just lost on where to start. Outside of Java, SQL, and R I'm useless. Here is an example of a single entry I have close to 1,000 of these to deal with):
<PaperTitle>True incidence of all complications following immediate and delayed breast reconstruction.</PaperTitle>
<Abstract>BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p < 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.</Abstract>
<BookTitle>Book1</BookTitle>
<Publisher>Publisher01, Boston</Publisher>
<Edition>1st</Edition>
<EditorList>
<Editor>
<LastName>Lewis</LastName>
<ForeName>Philip M</ForeName>
<Initials>PM</Initials>
</Editor>
<Editor>
<LastName>Kiffer</LastName>
<ForeName>Michael</ForeName>
<Initials>M</Initials>
</Editor>
</EditorList>
<Page>19-28</Page>
<Year>2008</Year>
<AuthorList>
<Author ValidYN="Y">
<LastName>Sullivan</LastName>
<ForeName>Stephen R</ForeName>
<Initials>SR</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Fletcher</LastName>
<ForeName>Derek R D</ForeName>
<Initials>DR</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Isom</LastName>
<ForeName>Casey D</ForeName>
<Initials>CD</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Isik</LastName>
<ForeName>F Frank</ForeName>
<Initials>FF</Initials>
</Author>
</AuthorList>
//
PaperTitle, Abstract, and Page, need to go into the Papers.txt file
PaperTitle, BookTitle, Edition, Publisher, and Year need to go into the Book.txt file
PaperTitle, all editor data LastName, ForeName, Initials need to go into the Editors.txt
PaperTitle, all author info LastName, ForeName, Initials need to go into the Authors.txt
// marks the end of an entry. All files need to be tab delimited.
While I wouldn't turn down finished code, I am hoping for at least some ideas to put me in the right direction of at least the code to parse out one of the files (like Book.txt) I could most likely figure it out from there. Many thanks."

This example should help you. It uses XML::Twig as I suggested to extract the fields for the Papers.txt output file. The record separator is set to "//\n" so that an entire block of data is read in one go, and before the block is parsed it is wrapped in <Paper>...</Paper> tags to make it valid XML.
use strict;
use warnings;
use 5.010;
use autodie;
use XML::Twig;
my $twig = XML::Twig->new;
open my $fh, '<', 'papers.txt';
local $/ = "//\n";
while (<$fh>) {
$twig->parse("<Paper>\n$_\n</Paper>\n");
my $root = $twig->root;
say $root->field($_) for qw/ PaperTitle Abstract Page/;
say '---';
}
output
True incidence of all complications following immediate and delayed breast reconstruction.
BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p < 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.
19-28
---

Please check this one:
use strict;
use warnings;
use Cwd;
#Get Directory
my $dir = getcwd();
#Grep files from the directory
opendir(DIR, $dir) || die "Couldn't open/read the $dir: $!";
my #AllFiles = grep(/\.txt$/i, readdir(DIR));
closedir(DIR);
#Check files are available
if(scalar(#AllFiles) ne '')
{
#Create Text Files as per Requirement
open(PAP, ">$dir/Papers.txt") || die "Couldn't able to create the file: $!";
open(BOOK, ">$dir/Book.txt") || die "Couldn't able to create the file: $!";
open(EDT, ">$dir/Editors.txt") || die "Couldn't able to create the file: $!";
open(AUT, ">$dir/Authors.txt") || die "Couldn't able to create the file: $!";
}
else { die "File Not found...$dir\n"; } #Die if not found files
foreach my $input (#AllFiles)
{
print "Processing file $input\n";
open(IN, "$dir/$input") || die "Couldn't able to open the file: $!";
local $/; $_=<IN>; my $tmp=$_;
close(IN);
#Loop from <PaperTitle> to // end slash
while($tmp=~m/(<PaperTitle>((?:(?!\/\/).)*)\/\/)/gs)
{
my $LoopCnt = $1;
my ($pptle) = $LoopCnt=~m/<PaperTitle>([^<>]*)<\/PaperTitle>/g;
my ($abstr) = $LoopCnt=~m/<Abstract>([^<>]*)<\/Abstract>/gs;
my ($pgrng) = $LoopCnt=~m/<Page>([^<>]*)<\/Page>/g;
my ($bktle) = $LoopCnt=~m/<BookTitle>([^<>]*)<\/BookTitle>/g;
my ($edtns) = $LoopCnt=~m/<Edition>([^<>]*)<\/Edition>/g;
my ($publr) = $LoopCnt=~m/<Publisher>([^<>]*)<\/Publisher>/g;
my ($years) = $LoopCnt=~m/<Year>([^<>]*)<\/Year>/g;
my ($EditorNames, $AuthorNames) = "";
$LoopCnt=~s#<EditorList>((?:(?!<\/EditorList>).)*)</EditorList>#
my $edtList = $1; my #Edlines = split/\n/, $edtList;
my $i ='1'; \#Editor Count to check
foreach my $EdsngLine(#Edlines)
{
if($EdsngLine=~m/<LastName>([^<>]*)<\/LastName>/)
{ $EditorNames .= $i."".$1."\t"; $i++; }
elsif($EdsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/)
{ $EditorNames .= $1."\t"; }
elsif($EdsngLine=~m/<Initials>([^<>]*)<\/Initials>/)
{ $EditorNames .= $1."\t"; }
}
#esg;
$LoopCnt=~s#<AuthorList>((?:(?!<\/AuthorList>).)*)</AuthorList>#
my $autList = $1; my #Autlines = split/\n/, $autList;
my $j ='1'; \#Author Count to check
foreach my $AutsngLine(#Autlines)
{
if($AutsngLine=~m/<LastName>([^<>]*)<\/LastName>/)
{ $AuthorNames .= $j."".$1."\t"; $j++; }
elsif($AutsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/)
{ $AuthorNames .= $1."\t"; }
elsif($AutsngLine=~m/<Initials>([^<>]*)<\/Initials>/)
{ $AuthorNames .= $1."\t"; }
}
#esg;
#Print the output in the crossponding text files
print PAP "$pptle\t$abstr\t$pgrng\t//\n";
print BOOK "$pptle\t$bktle\t$edtns\t$publr\t$years\t//\n";
print EDT "$pptle\t$EditorNames//\n";
print AUT "$pptle\t$AuthorNames//\n";
}
}
print "Process Completed...\n";
#Don't forget to close the files
close(PAP);
close(BOOK);
close(EDT);
close(AUT);
#End

Modify Perl script to output new line every 20 words

I'm using an open-source perl script to create a text corpus based on the English language wikipedia dump. The plain text has been extracted, but various punctuation marks and the like still need to be removed. However, the output of this script essentially creates a 7.2GiB text file containing a single line. Due to my needs, I want to alter the script such that it inserts a new line character every 20 words.
So far, I've tried this:
$wordCount=0;
while (<STDIN>) {
$wordCount++;
//text processing regex commands here
# Remove ellipses
s/\.\.\./ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\�,;:%�?�!()\[\]{}<>_\.])/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/'' / /g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/�/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/ / /g;
chomp($_);
if ($wordCount == 20){
print uc($_) . "\n";
$wordCount=0;
}
print uc($_) . " ";
}
print "\n";
However, this doesn't seem to work, as the raw output has only newlines scattered around arbitrarily. I'd like to have the text formatted so it will fit on a typical 1200px wide monitor without word wrapping.
A sample input text from the file is
The Concise Oxford Dictionary of Politics. Proponents of anarchism
(known as "anarchists") advocate stateless societies as the only moral
form of social organization. There are many types and traditions of
anarchism, not all of which are mutually exclusive. Anarchism as a
social movement has regularly endured fluctuations in popularity. The
term anarchism derives from the Greek ἄναρχος, anarchos, meaning
"without rulers", its use as a synonym is still common outside the
United States. The earliest anarchist themes can be found in the 6th
century BC, among the works of Taoist philosopher Laozi, and in later
centuries by Zhuangzi and Bao Jingyan. The term "anarchist" first
entered the English language in 1642, during the English Civil War, as
a term of abuse, used by Royalists against their Roundhead opponents.
By the time of the French Revolution some, such as the Enragés, began
to use the term positively, in opposition to Jacobin centralisation
of power, seeing "revolutionary government" as oxymoronic. By the
turn of the 19th century, the English word "anarchism" had lost its
initial negative connotation. Modern anarchism sprang from the secular
or religious thought of the Enlightenment, particularly Jean-Jacques
Rousseau's arguments for the moral centrality of freedom. Anarchism",
Encarta Online Encyclopedia 2006 (UK version). From this climate
William Godwin developed what many consider the first expression of
modern anarchist thought. Godwin was, according to Peter Kropotkin,
"the first to formulate the political and economical conceptions of
anarchism, even though he did not give that name to the idea s
developed in his work", while Godwin attached his anarchist ideas to
an early Edmund Burke. The anarcho-communist Joseph Déjacque was the
first person to describe himself as "libertarian". Unlike Proudhon, he
argued that, "it is not the product of his or her labor that the
worker has a right to, but to the satisfaction of his or her needs,
whatever may be t heir nature. Jesus is sometimes considered the first
anarchist in the Christian anarchist tradition. Georges Lechartier
wrote that "The true founder of anarchy was Jesus Christ and . In
Europe, harsh reaction followed the revolutions of 1848, during which
ten countries had experienced brief or long-term social upheaval as
groups carried out nationalis t uprisings. After most of these
attempts at systematic change ended in failure, conservative elements
took advantage of the divided groups of socialists, anarchists,
liberals, and na tionalists, to prevent further revolt. Blanquists,
Philadelphes, English trade unionists, socialists and social
democrats. Due to its links to active workers' movements, the
International became a significant organization. Karl Marx became a
leading figure in the International and a member of its General
Council. Proudhon's followers, the mutualists, opposed Marx's state
socialism, advocating political abstentionism and small property
holdings. In 1868, following their unsuccessful participation in the
League of Peace and Freedom (LPF), Russian revolutionary Mikhail
Bakunin and his collectivist anarchist associa tes joined the First
International (which had decided not to get involved with the LPF). At
first, the collectivists worked with the Marxists to push the First
International in a more revolutionary socialist direction.
Subsequently, the International became polarised into two camps, with
Marx and Bakunin as their respective figureheads. In 1872, the
conflict climaxed with a final split between the two groups at the
Hague Congress, where Bakunin and James Guillaume were expelled from
the International and its headquarters were transferred to New York.
In response, the federalist sections formed their own International at
the St. Imier Congress, adopting a revolutionary anarchist program.
Black Rose Books 2005) ISBN 1-55164-251-4.
There's 7-something gigs worth of text in the file. So using a list or other data structure might be a bit of overkill for these requirements.
What is needed in order to fit my requirements?

Consider using something like Text::Wrap or Text::Autoformat .

open my $in, '<', $inFileName;
open my $out, '>', $outFileName;
my $wordcount = 0;
while(defined( my $line = <$in> )){
$line=~s/\n//g; #remove newline character
#split the words into an array(could use '\W+' instead of ' ')
my #words = split ' ', $line;
foreach my $word (#words){
$wordCount++;
if ($wordCount == 20){
$wordCount = 0;
print $out "\n";
}
else {
print $out uc($word)." ";
}
} # end of foreach line in input
} # end of file while loop
close $in;
close $out;

First, set perl's input record separator to something frequent and useful, like a space:
$/ = ' ';
then loop over the input word by word:
while (<>) {
trim the word:
s/^\s+|\s+$//g;
skip it if it was all space:
$_ or next;
do any other transforms you need
and then add it to a stack, splitting any internal tabs or other space-like characters:
push #words, split /\s+/;
next, check to see if you have 20 words, and if so, print them:
print join(' ' => splice #words, 0, 20), "\n" while #words >= 20;
}
then print anything remaining:
print "#words\n" if #words;

Without knowing more details about this problem, I'd suggest a brute force solution:
slurp the entire entry,
split to an array based on " ",
foreach the array and print "\n" after every 20 elements.

True to Perl, there are various ways to solve this, but one (perverse?!) way to do it is to read the file byte by byte instead of line by line, or slurping the whole thing in. It's rather brute force-ish but it works. Essentially you are trading memory use for disk usage.
#!/usr/bin/perl -w
use strict;
open(IN, "in.txt") or die;
my $rc = 1;
my $wc = 0;
my $new;
while ($rc != 0)
{
# Read a byte - not safe for Unicode or other double-byte environments!
$rc = read IN, $new, 1, 0;
# We're only interested if the byte isn't punctuation (POSIX character class).
if ($new !~ m/[[:punct:]]/)
{
# word boundary?
if ($new =~ m/ /)
{
$wc++;
if ($wc % 20 == 0)
{
print "\n"; # 20th word, time for a new line.
}
}
print $new;
}
# move on to the next byte
seek IN, 0, 1;
}
close(IN);

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse