exit while loop after find first match - perl

I need to exit the loop after find first match and go to another search in the loop
use strict;
use warnings;
my %iptv;
sub trim($) {
my $string = shift;
$string =~ s/\r\n//g;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
my #files=</tests/*>;
open IN, "/20131105.csv";
LINE: while (<IN>) {
chomp;
my #result = split(/;/,$_);
my $result1 = trim($_);
$result[1] = trim($result[1]);
$iptv{$result[1]} = $result1;
}
close IN;
foreach my $file (#files) {
open FILE, "$file";
while (<FILE>) {
chomp;
my ($mac, $date) = split(/;/,$_);
my #date1 = split(/\s/, $date);
print "$iptv{$mac};$date1[0]\n" if defined $iptv{$mac};
last LINE if (defined $iptv{$mac});
}
close FILE;
}
I tried to use "last" function but it finds first match and ends program. where I have to put last?

Lets take a look at the documentation:
$ perldoc -f last
last LABEL
last The "last" command is like the "break" statement in C (as used
in loops); it immediately exits the loop in question. If the
LABEL is omitted, the command refers to the innermost enclosing
loop. The "continue" block, if any, is not executed:
LINE: while (<STDIN>) {
last LINE if /^$/; # exit when done with header
#...
}
"last" cannot be used to exit a block that returns a value such
as "eval {}", "sub {}" or "do {}", and should not be used to
exit a grep() or map() operation.
Note that a block by itself is semantically identical to a loop
that executes once. Thus "last" can be used to effect an early
exit out of such a block.
See also "continue" for an illustration of how "last", "next",
and "redo" work.
We can clearly read here about how to use last. If a label is omitted, it breaks out of the innermost loop. So only in the case where we do not want this do we use a label. You want this, so you do not want a label.
Some notes on your code:
Check the return value of open, and use three arguments with a lexical file handle.
open my $fh, "<", $file or die "Cannot open $file: $!";
This also has the benefit that when the lexical variable $fh goes out of scope, the file handle is closed.
When you split on \s you split on a single whitespace. Most often, this is not what you want. If you for example have a date such as
$str = "Jan 1 2013" # (note the two consecutive spaces)
...this will split into the list "Jan", "", "1", "2013" (note the empty field). This is only what you want if empty fields are relevant, such as with csv-like data. The default behaviour of split uses ' ' (a space character), which acts like /\s+/, except that it also strips leading whitespace.
Note that the two last statements inside this loop can be merged. Also, the use of the temp array #date1 is not needed. So that your code looks like:
open my $fh, "<", $file or die "Cannot open $file: $!";
while (<$fh>) {
chomp;
my ($mac, $date) = split /;/, $_;
($date) = split ' ', $date;
if (defined $iptv{$mac}) {
print "$iptv{$mac};$date\n" ;
last;
}
}

foreach my $file (#files) {
open FILE, "$file";
LINE: while (<FILE>) {
chomp;
my ($mac, $date) = split(/;/,$_);
my #date1 = split(/\s/, $date);
print "$iptv{$mac};$date1[0]\n" if defined $iptv{$mac};
last LINE if (defined $iptv{$mac});
}
close FILE;
}
Should make sure that you only exit the inner loop.
I suppose it would work just as well if you got rid of the LINE Label right behind the last alltogether but i would suggest allways using a label with last to be certain that it does not do something unexpected in case you add an additional inner loop and forget about the last within that you expect to leave a loop farther on the outside.

Related

Copying content from one file to another file using perl

Following code is for copying file content from readfile to writefile. Instead of copying upto last, i want to copy upto some keyword.
use strict;
use warnings;
use File::Slurp;
my #lines = read_file('readfile.txt');
while ( my $line = shift #lines) {
next unless ($line =~ m/END OF HEADER/);
last; # here suggest some other logic
}
append_file('writefile.txt', #lines);
next will continue to the next iteration of the loop, effectively skipping the rest of the statements in the loop for that iteration (in this case, the last).
last will immediately exit the loop, which sounds like what you want. So you should be able to simply put the conditional statement on the last.
Also, I'm not sure why you want to read the entire file into memory to iterate over its lines? Why not just use a regular while(<>)? And I would recommend avoiding File::Slurp, it has some long-standing issues.
You don't show any example input with expected output, and your description is unclear - you said "i want to copy upto some keyword" but in your code you use shift, which removes items from the beginning of the array.
Do you want to remove the lines before or after and including or not including "END OF HEADER"?
This code will copy over only the header:
use warnings;
use strict;
my $infile = 'readfile.txt';
my $outfile = 'writefile.txt';
open my $ifh, '<', $infile or die "$infile: $!";
open my $ofh, '>', $outfile or die "$outfile: $!";
while (<$ifh>) {
last if /END OF HEADER/;
print $ofh $_;
}
close $ifh;
close $ofh;
Whereas if you want to copy everything after the header, you could replace the while above with:
while (<$ifh>) {
last if /END OF HEADER/;
}
while (<$ifh>) {
print $ofh $_;
}
Which will loop and do nothing until it sees END OF HEADER, then breaking out of the first loop and moving to the second, which prints out the lines after the header.
data.txt:
fsffs
sfsfsf
sfSDFF
END OF HEADER
{ dsgs xdgfxdg zFZ }
dgdbg
vfraeer
Code:
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
my $infile = 'data.txt';
my $header_file = 'header.txt';
my $after_header_file = 'after_header.txt';
open my $DATA, '<', $infile;
open my $HEADER, '>', $header_file;
open my $AFTER_HEADER, '>', $after_header_file;
{
local $/ = "END OF HEADER";
my $header = <$DATA>;
say {$HEADER} $header;
my $rest = <$DATA>;
say {$AFTER_HEADER} $rest;
}
close $DATA;
close $HEADER;
close $AFTER_HEADER;
say "Created files: $header_file, $after_header_file";
Output:
$ perl 1.pl
Created files: header.txt, after_header.txt
$ cat header.txt
fsffs
sfsfsf
sfSDFF
END OF HEADER
$ cat after_header.txt
{ dsgs xdgfxdg zFZ }
dgdbg
vfraeer
$/ specifies the input record separator, which by default is a newline. Therefore, when you read from a file:
while (my $x = <$INFILE>) {
}
each value of $x is a sequence of characters up to and including the input recored separator, i.e. a newline, which is what we normally think of as a line of text in a file. Often, we chomp off the newline/input_record_separator at the end of the text:
while (my $x = <$INFILE>) {
chomp $x;
say "$x is a dog";
}
But, you can set the input record separator to anything you want, like your "END OF HEADER" text. That means a line will be all the text up to and including the input record separator, which in this case is "END OF HEADER". For example, a line will be: "abc\ndef\nghi\nEND OF HEADER". Furthermore, chomp() will now remove "END OF HEADER" from the end of its argument, so you could chomp your line if you don't want the "END OF HEADER" marker in the output file.
If perl cannot find the input record separator, then perl keeps reading the file until perl hits the end of the file, then perl returns all the text that was read.
You can use those operations to your advantage when you want to seek to some specific text in a file.
Declaring a variable as local makes the variable magical: when the closing brace of the surrounding block is encountered, perl sets the variable back to the value it had just before the opening brace of the surrounding block:
#Here, by default $/ = "\n", but some code out here could have
#also set $/ to something else
{
local $/ = "END OF HEADER";
} # $/ gets set back to whatever value it had before this block
When you change one of perl's predefined global variables, it's considered good practice to only change the variable for as long as you need to use the variable, then change the variable back to what it was.
If you want to target just the text between the braces, you can do:
data.txt:
fsffs
sfsfsf
sfSDFF
END OF HEADER { dsgs xdgfxdg zFZ }
dgdbg
vfraeer
Code snippet:
...
...
{
local $/ = 'END OF HEADER {';
my $pre_brace = <$DATA>;
$/ = '}';
my $target_text = <$DATA>;
chomp $target_text; #Removes closing brace
say "->$target_text<-";
}
--output:--
-> dsgs xdgfxdg zFZ <-

find a match and replace next line in perl

I am working on the perl script and need some help with it. The requirement is, I have to find a lable and once the label is found, I have to replace the word in a line immediately following the label. for Example, if the label is ABC:
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
I want to write a script to match the label (ABC) and once the label is found, replace a word in the next line immediately following the label.
Here is my attempt:
open(my $fh, "<", "file1.txt") or die "cannot open file:$!";
while (my $line = <$fh>))
{
next if ($line =~ /ABC/) {
$line =~ s/original_string/replaced_string/;
}
else {
$msg = "pattern not found \n ";
print "$msg";
}
}
Is this correct..? Any help will be greatly appreciated.
The following one-liner will do what you need:
perl -pe '++$x and next if /ABC:/; $x-- and s/old/new/ if $x' inFile > outFile
The code sets a flag and gets the next line if the label is found. If the flag is set, it's unset and the substitution is executed.
Hope this helps!
You're doing this in your loop:
next if ($line =~ /ABC/);
So, you're reading the file, if a line contains ABC anywhere in that line, you skip the line. However, for every other line, you do the replacement. In the end, you're replacing the string on all other lines and printing that out, and your not printing out your labels.
Here's what you said:
I have to read the file until I find a line with the label:
Once the label is found
I have to read the next line and replace the word in a line immediately following the label.
So:
You want to read through a file line-by-line.
If a line matches the label
read the next line
replace the text on the line
Print out the line
Following these directions:
use strict;
use warnings; # Hope you're using strict and warnings
use autodie; # Program automatically dies on failed opens. No need to check
use feature qw(say); # Allows you to use say instead of print
open my $fh, "<", "file1.txt"; # Removed parentheses. It's the latest style
while (my $line = <$fh>) {
chomp $line; # Always do a chomp after a read.
if ( $line eq "ABC:" ) { # Use 'eq' to ensure an exact match for your label
say "$line"; # Print out the current line
$line = <$fh> # Read the next line
$line =~ s/old/new/; # Replace that word
}
say "$line"; # Print the line
}
close $fh; # Might as well do it right
Note that when I use say, I don't have to put the \n on the end of the line. Also, by doing my chomp after my read, I can easily match the label without worrying about the \n on the end.
This is done exactly as you said it should be done, but there are a couple of issues. The first is that when we do $line = <$fh>, there's no guarantee we are really reading a line. What if the file ends right there?
Also, it's bad practice to read a file in multiple places. It makes it harder to maintain the program. To get around this issue, we'll use a flag variable. This allows us to know if the line before was a tag or not:
use strict;
use warnings; # Hope you're using strict and warnings
use autodie; # Program automatically dies on failed opens. No need to check
use feature qw(say); # Allows you to use say instead of print
open my $fh, "<", "file1.txt"; # Removed parentheses. It's the latest style
my $tag_found = 0; # Flag isn't set
while (my $line = <$fh>) {
chomp $line; # Always do a chomp after a read.
if ( $line eq "ABC:" ) { # Use 'eq' to ensure an exact match for your label
$tag_found = 1 # We found the tag!
}
if ( $tag_found ) {
$line =~ s/old/new/; # Replace that word
$tag_found = 0; # Reset our flag variable
}
say "$line"; # Print the line
}
close $fh; # Might as well do it right
Of course, I would prefer to eliminate mysterious values. For example, the tag should be a variable or constant. Same with the string you're searching for and the string you're replacing.
You mentioned this was a word, so your regular expression replacement should probably look like this:
$line =~ s/\b$old_word\b/$new_word/;
The \b mark word boundaries. This way, if you're suppose to replace the word cat with dog, you don't get tripped up on a line that says:
The Jeopardy category is "Say what".
You don't want to change category to dogegory.
Your problem is that reading in a file does not work like that. You're doing it line by line, so when your regex tests true, the line you want to change isn't there yet. You can try adding a boolean variable to check if the last line was a label.
#!/usr/bin/perl;
use strict;
use warnings;
my $found;
my $replacement = "Hello";
while(my $line = <>){
if($line =~ /ABC/){
$found = 1;
next;
}
if($found){
$line =~ s/^.*?$/$replacement/;
$found = 0;
print $line, "\n";
}
}
Or you could use File::Slurp and read the whole file into one string:
use File::Slurp;
$x = read_file( "file.txt" );
$x =~ s/^(ABC:\s*$ [\n\r]{1,2}^.*?)to\sbe/$1to was/mgx;
print $x;
using /m to make the ^ and $ match embedded begin/end of lines
x is to allow the space after the $ - there is probably a better way
Yields:
ABC:
string to was replaced
some other lines
ABC:
string to was replaced
some other lines
ABC:
string to was replaced
Also, relying on perl's in-place editing:
use File::Slurp qw(read_file write_file);
use strict;
use warnings;
my $file = 'fakefile1.txt';
# Initialize Fake data
write_file($file, <DATA>);
# Enclosed is the actual code that you're looking for.
# Everything else is just for testing:
{
local #ARGV = $file;
local $^I = '.bac';
while (<>) {
print;
if (/ABC/ && !eof) {
$_ = <>;
s/.*/replaced string/;
print;
}
}
unlink "$file$^I";
}
# Compare new file.
print read_file($file);
1;
__DATA__
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
ABC:
outputs
ABC:
replaced string
some other lines
ABC:
replaced string
some other lines
ABC:
replaced string
ABC:

how to extract substrings by knowing the coordinates

I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...
I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract.
The first file is like:
>scaffold30 24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................
(the string must be everything in the file except the first line), and the coordinates file is like:
44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
.....................
my script is this:
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;
while(<INFILE1>) {
if ($_ !~ m/^>/) {
local $/ = undef;
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my #cline = <INFILE>;
foreach my $cline (#cline) {
print "$cline\n";
my#data = split('\t', $cline);
my $start = $data[0];
my $end = $data[1];
my $offset = $end - $start;
$count++;
my $sub = substr ($var, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE;
}
}
when I run it, it looks like it does only one iteration and it prints me the start of the first file.
It seems like the foreach loop doesn't work.
also substr seems that doesn't work.
when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.
I am sorry if I become annoying, but I must finish it and I am a little bit desperate...
Thank you again.
This line
local $/ = undef;
changes $/ for the entire enclosing block, which includes the section where you read in your second file. $/ is the input record separator, which essentially defines what a "line" is (it is a newline by default, see perldoc perlvar for details). When you read from a filehandle using <>, $/ is used to determine where to stop reading. For example, the following program relies on the default line-splitting behavior, and so only reads until the first newline:
my $foo = <DATA>;
say $foo;
# Output:
# 1
__DATA__
1
2
3
Whereas this program reads all the way to EOF:
local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3
__DATA__
1
2
3
This means your #cline array gets only one element, which is a string containing the text of your entire coordinates file. You can see this using Data::Dumper:
use Data::Dumper;
print Dumper(\#cline);
Which in your case will output something like:
$VAR1 = [
'44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
'
];
Notice how your array (technically an arrayref in this case), delineated by [ and ], contains only a single element, which is a string (delineated by single quotes) that contains newlines.
Let's walk through the relevant sections of your code:
while(<INFILE1>) {
if ($_ !~ m/^>/) {
# Enable localized slurp mode. Stays in effect until we leave the 'if'
local $/ = undef;
# Read the rest of INFILE1 into $var (from current line to EOF)
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
# In list context, return each block until the $/ character as a
# separate list element. Since $/ is still undef, this will read
# everything until EOF into our first list element, resulting in
# a one-element array
my #cline = <INFILE>;
# Since #cline only has one element, the loop only has one iteration
foreach my $cline (#cline) {
As a side note, your code could be cleaned up a bit. The names you chose for your filehandles leave something to be desired, and you should probably use lexical filehandles anyway (and the three-argument form of open):
open my $chromosome_fh, "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;
Also, you do not need to nest your loops in this case, it just makes your code more convoluted. First read the relevant parts of your chromosome file into a variable (named something more meaningful than var):
# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
next if /^>/;
$chromosome .= $_;
}
Then read in your coordinates file:
my #cline = <$coordinates_fh>;
Or if you only need to use the contents of the coordinates file once, process each line as you go using a while loop:
while (<$coordinates_fh>) {
# Do something for each line here
}
As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.
#!/usr/bin/perl
use strict;
use warnings;
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;
<INFILE1>; # get rid of the first line - '>scaffold30 24194'
while(<INFILE1>) {
chomp;
$fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;
while(<INFILE>) {
my ($start, $end) = split;
# Or, should this be: my $offset = $end - ($start - 1);
# That would include the start fasta
my $offset = $end - $start;
$count++;
my $sub = substr ($fasta, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";

Print only the first word in line

I need some help with following perl code.
#!perl -w
use strict;
use warnings;
open my $file, '<', 'ubb' or die $1;
my $spool = 0;
my #matchingLines;
while (<$file>) {
if (/GROUPS/i) {
$spool = 1;
next;
}
elsif (/SERVERS/i) {
$spool = 0;
print map { "$_" } #matchingLines;
#matchingLines = ();
}
if ($spool) {
push (#matchingLines, $_);
}
}
close ($file);
Output from that is shown below.
ADM LMID=GW_S4_1_PM,GW_S4_2_BM
GRPNO=1
ADM_TMS LMID=GW_S4_1_PM,GW_S4_2_BM
GRPNO=2
TMSNAME=TMS
ADM_1 LMID=GW_S4_1_PM
GRPNO=11
ADM_2 LMID=GW_S4_2_BM
GRPNO=12
DMWSG_Gateway_1 LMID=GW_S4_1_PM
GRPNO=101
ENVFILE="../GW_S4.Gateway.envfile"
DMWSG_Gateway_2 LMID=GW_S4_2_BM
GRPNO=201
ENVFILE="../GW_S4.Gateway.envfile"
DMWSG_1 LMID=GW_S4_1_PM
GRPNO=106
DMWSG_2 LMID=GW_S4_2_BM
GRPNO=206
But I only would like to get the first word of each line (e.g. ADM, ADM_TMS, ADM_1).
Note that the file has a lot of other lines above and below what's printed here. I only want to do this for lines that is in between GROUPS and SERVERS.
I would suggest 2 changes in your code
Note: Tested these with your sample data (plus other stuff) in your question.
I: Extract first word before push
Change this
push (#matchingLines, $_);
to
push (#matchingLines, /^(\S+)/);
This would push the first word of each line into the array, instead of the entire line.
Note that /^(\S+)/ is shorthand for $_ =~ /^(\S+)/. If you're using an explicit loop variable like in 7stud's answer, you can't use this shorthand, use the explicit syntax instead, say $line =~ /^(\S+)/ or whatever your loop variable is.
Of course, you can also use split function as suggested in 7stud's answer.
II: Change how you print
Change this
print map { "$_" } #matchingLines;
into
local $" = "\n";
print "#matchingLines \n";
$" specifies the delimiter used for list elements when the array is printed with print or say inside double quotes.
Alternatively, as per TLP's suggestion,
$\ = $/;
print for #lines;
or
print join("\n", #lines), "\n"
Note that $/ is the input record separator (newline by default), $\ is the output record separator (undefined by default). $\ is appended after each print command.
For more information on $/, $\, and $":
See perldoc perlvar (just use CTRL+F to find them in that page)
Or you can simply use perldoc -v '$/' etc on your console to get those information.
Note on readability
I don't think implicit regex matching i.e. /pattern/ is bad per se.
But matching against a variable, i.e. $variable =~ /pattern/ is more readable (as in you can immediately see there's a regex matching going on) and more beginner-friendly, at the cost of conciseness.
use strict;
use warnings;
use 5.014; #say()
my $fname = 'data.txt';
open my $INFILE, '<', $fname
or die "Couldn't open $fname: $!"; #-->Not $1"
my $recording_on = 0;
my #matching_lines;
for my $line (<$INFILE>) {
if ($line =~ /groups/i) {
$recording_on = 1;
next;
}
elsif ($line =~ /servers/i) {
say for #matching_lines; #say() is the same as print(), but it adds a newline at the end
#matching_lines = ();
$recording_on = 0;
}
if ($recording_on) {
my ($first_word, $trash) = split " ", $line, 2;
push #matching_lines, $first_word;
}
}
close $INFILE;
You can use the flip-flop operator (range) to select a part of your input. The idea of this operator is that it returns false until its LHS (left hand side) returns true, and after that it returns true until its RHS returns false, after which it is reset. It is somewhat like preserving a state.
Note that the edge lines are also included in the match, so we need to remove those. After that, use doubleDown's idea and push /^(\S+)/ onto an array. The nice thing about using this with push is that the capture regex returns an empty list if it fails, and this gives us a warning-free failure when the regex does not match.
use strict;
use warnings;
my #matches;
while (<>) {
if (/GROUPS/i .. /SERVERS/i) { # flip-flop remembers the matches
next if (/GROUPS/i or /SERVERS/i);
push #matches, /^(\S+)/;
}
}
# #matches should now contain the first words of those lines

How do I list multiple sentences which contain the same word. The heading is the word that is contained in those sentences

This currently prints all the nouns with sentences they are found in right below.
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
my $search_key = "expend"; ## CHANGE "..." to <>
open(my $tag_corpus, '<', "ch13tagged.txt") or die $!;
my #sentences = <$tag_corpus>; # This breaks up each line into list
my #words;
my %seens = ();
my %seenw = ();
for (my $i = 0; $i <= #sentences; $i++) {
if (defined($sentences[$i]) and $sentences[$i] =~ /($search_key)_VB.*/i) {
#words = split /\s/, $sentences[$i]; ## \s is a whitespace
for (my $j = 0; $j <= #words; $j++) {
#FILTER if word is noun, and therefore will end with _NN:
if (defined($words[$j]) and $words[$j] =~ /_NN/) {
#PRINT word (without _NN) and sentence (without any _ENDING):
next if $seenw{$words[$j]}++; ## How to include plural etc
push #words, $words[$j];
print "**", split(/_\S+/, $words[$j]), "**", "\n";
## next if $seens{ $sentences[$i] }++;
## push #sentences, $sentences[$i];
print split(/_\S+/, $sentences[$i]), "\n"
## HOW PRINT bold or specifically word bold?
#FILTER if word has been output, add sentence under that heading
}
} ## put print sentences here to print each sentence after all the nouns inside
}
}
close $tag_corpus || die "Can't close $tag_corpus: $!";
Your original:
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
That's a good start...
my $search_key = "expend"; ## CHANGE "..." to <>
Since you're going to use this in a regex in a loop, it's better to compile the
regex right now: my $verb_regex = qr/\bexpend_VB\b/i. I put word boundaries in
there, because it seems like you need them. '
open(my $tag_corpus, '<', "ch13tagged.txt") or die $!;
my #sentences = <$tag_corpus>; # This breaks up each line into list
my #words;
my %seens = ();
my %seenw = ();
for (my $i = 0; $i <= #sentences; $i++) {
This does much of the same with less overhead:
while ( <$tag_corpus> ) {
...
Back to yours:
if (defined($sentences[$i]) and $sentences[$i] =~ /($search_key)_VB.*/i) {
If the line contains the record separator--and it will unless you chomp it, you'll always be
getting a defined line until the end of the file. There's no need to test for defined.
Additionally, you don't need the .* after the search term and capturing the $search_key
here has no effect.
#words = split /\s/, $sentences[$i]; ## \s is a whitespace
You don't want to split on a single space for whitespace. You should use /\s+/, but
even better is: #words = split ' ', $sentences[$i];
But you won't even need that.
for (my $j = 0; $j <= #words; $j++) {
#FILTER if word is noun, and therefore will end with _NN:
if (defined($words[$j]) and $words[$j] =~ /_NN/) {
#PRINT word (without _NN) and sentence (without any _ENDING):
But that's all you're if-ing on: words ending in _NN. In addition, the whole
list from a split will be defined-- no need to test.
next if $seenw{$words[$j]}++; ## How to include plural etc
Unless you want to reset %seenw after each sentence, you'll only process each _NN
word once per file.
push #words, $words[$j];
I don't see how this push can serve any possible purpose by appending nouns
back on the list of words. Sure you've got the uniqueness check before it to save
you from the infinite loop if there are any _NN words, but it just means you'll have
all the words in the sentence, followed by all the "nouns". Not only that, but you're simply
going to test that it's an noun and do nothing with it. Not to mention that you
clobber the list with the next sentence.
print "**", split(/_\S+/, $words[$j]), "**", "\n";
## next if $seens{ $sentences[$i] }++;
You don't want to do this in the word loop
## push #sentences, $sentences[$i];
Again, I'm not thinking that you would want to do this if it were uncommented
and outside the word loop. It seems like everything from 2 lines ago would be
after the word loop.
print split(/_\S+/, $sentences[$i]), "\n"
## HOW PRINT bold or specifically word bold?
#FILTER if word has been output, add sentence under that heading
}
} ## put print sentences here to print each sentence after all the nouns inside
}
}
close $tag_corpus || die "Can't close $tag_corpus: $!";
Nope. That won't handle the bad return from close. The || or is "binding" too
tightly. You are closing either $tag_corpus or the output of die. Luckily (or perhaps unluckily)
the die never gets called because if we got this far, $tag_corpus should be a
true value.
This is a kind of cleaned-up version of what you're trying to do--with the
parts that I can make sense of left in.
my #sentences;
# We're processing a single line at a time.
while ( <$tag_corpus> ) {
# Test if we want to work with the line
next unless m/$verb_regex/;
# If we do, then test that we haven't dealt with it before
# Although I suspect that this may not be needed as much if we're not
# pushing to a queue that we're reading from.
next if $seens{ $_ }++;
# split -> split ' ', $_
# pass through only those words that match _NN at the end and
# are unique so far. We test on a substitution, because the result
# still uniquely identifies a noun
foreach my $noun ( grep { s/_NN$// && !$seenw{ $_ }++ } split ) {
print "**$noun**\n";
}
# This will omit any adjacent punctuation you have after the word--if
# that's a problem.
print split( /_\S+/ ), "\n";
# Here we save the sentence.
push #sentences, $_;
}
close $tag_corpus or die "Can't close ch13tagged.txt: $!";