Find last occurrences of a string from multiple files in a directory using perl - perl

In a directory have multiple file . from all that file i have to find the last occurrence of a string and print that line into a result.txt file.
Let assume the files are like demo*.txt.
And the string i have to find is "FREQ" .
The code i have tried is as follows :
for ( glob("demo*.txt" ) ) {
my $xm = $_;
open my $fh, '<', $xm;
my $pat = "FREQ" ;
while(my $asa = <$fh>) {
my #last = grep(/FREQ/, $asa);
}
}
print " grep : #last\n";
I have tried this code but this giving nothing and I cant able to find the mistake i am making.

As #vkk05 suggests, you need to add use strict & use warnings to your code
use strict;
use warnings;
for ( glob("/tmp/demo*.txt" ) ) {
my $xm = $_;
open my $fh, '<', $xm;
my $pat = "FREQ" ;
while(my $asa = <$fh>) {
my #last = grep(/FREQ/, $asa);
}
}
print " grep : #last\n";
When I run that I get
Possible unintended interpolation of #last in string at /tmp/fred.pl line 15.
Global symbol "#last" requires explicit package name (did you forget to declare "my #last"?) at /tmp/fred.pl line 15.
Execution of /tmp/fred.pl aborted due to compilation errors.
The problem is the my #last line. You are defining #last in the scope of the while loop. That means the variable #last gets deleted once the while loop exits.
Moving the definition of #last to outside the while scope and moving the print line to run immediately after the while loop gives this code
use strict;
use warnings;
for ( glob("/tmp/demo*.txt" ) ) {
my $xm = $_;
open my $fh, '<', $xm;
my $pat = "FREQ" ;
my #last;
while(my $asa = <$fh>) {
#last = grep(/FREQ/, $asa);
}
print " grep : #last\n";
}
and assuming a test file /tmp/demo1.txt
$ cat /tmp/demo1.txt
abc
FREQ 123
FREQ 456
running the script gives
grep : FREQ 456
Finally, the use of grep in this context is confusing -- it is intended to be used when iterating through a list of values, rather than a single entry as in this case. Although the code works fine with the way you are using grep, a more perl-idiomatic solution that uses grep would be
use strict;
use warnings;
my $pat = "FREQ" ;
for my $xm ( glob("/tmp/demo*.txt" ) ) {
open my $fh, '<', $xm
or die "Cannot open '$xm': $!";
my #matches = grep { /$pat/ }
<$fh>;
print " file $xm : $matches[-1]\n";
}
Points to note
grep is used here in a proper list context, along with <$fh> to walk the file and match all lines that include the pattern stored in $pat
The results are stored in #matches, so to get the last one, the print statement uses $matches[-1] to index the final entry in #matches.

If this is a one time only task I would use a simple one-liner like this.
Create four test files. The last is empty:
echo -e "FREQ\nline 2\nFREQ again\nline4" > demo1.txt
echo -e "FREQUENT\nline 2 in demo2\nShrek\nXFREQ here" > demo2.txt
echo -e "nothing to see\nnothing to see!" > demo3.txt
touch demo4.txt
Run:
perl -E '#f=#ARGV; /FREQ/ and $L{$ARGV}=$_ while<>; print "Last FREQ line in $_: ".($L{$_}//"<none>\n") for #f' demo*.txt | tee result.txt
Output:
Last FREQ line in demo1.txt: FREQ again
Last FREQ line in demo2.txt: XFREQ here
Last FREQ line in demo3.txt: <none>
Last FREQ line in demo4.txt: <none>

Related

Split file Perl

I want to split parts of a file. Here is what the start of the file looks like (it continues in same way):
Location Strand Length PID Gene
1..822 + 273 292571599 CDS001
906..1298 + 130 292571600 trxA
I want to split in Location column and subtract 822-1 and do the same for every row and add them all together. So that for these two results the value would be: (822-1)+1298-906) = 1213
How?
My code right now, (I don't get any output at all in the terminal, it just continue to process forever):
use warnings;
use strict;
my $infile = $ARGV[0]; # Reading infile argument
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $line2 = <$IN>;
my $coding = 0; # Initialize coding variable
while(my $line = $line2){ # reading the file line by line
# TODO Use split and do the calculations
my #row = split(/\.\./, $line);
my #row2 = split(/\D/, $row[1]);
$coding += $row2[0]- $row[0];
}
print "total amount of protein coding DNA: $coding\n";
So what I get from my code if I put:
print "$coding \n";
at the end of the while loop just to test is:
821
1642
And so the first number is correct (822-1) but the next number doesn't make any sense to me, it should be (1298-906). What I want in the end outside the loop:
print "total amount of protein coding DNA: $coding\n";
is the sum of all the subtractions of every line i.e. 1213. But I don't get anything, just a terminal that works on forever.
As a one-liner:
perl -nE '$c += $2 - $1 if /^(\d+)\.\.(\d+)/; END { say $c }' input.txt
(Extracting the important part of that and putting it into your actual script should be easy to figure out).
Explicitly opening the file makes your code more complicated than it needs to be. Perl will automatically open any files passed on the command line and allow you to read from them using the empty file input operator, <>. So your code becomes as simple as this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $total;
while (<>) {
my ($min, $max) = /(\d+)\.\.(\d+)/;
next unless $min and $max;
$total += $max - $min;
}
say $total;
If this code is in a file called adder and your input data is in add.dat, then you run it like this:
$ adder add.dat
1213
Update: And, to explain where you were going wrong...
You only ever read a single line from your file:
my $line2 = <$IN>;
And then you continually assign that same value to another variable:
while(my $line = $line2){ # reading the file line by line
The comment in this line is wrong. I'm not sure where you got that line from.
To fix your code, just remove the my $line2 = <$IN> line and replace your loop with:
while (my $line = <$IN>) {
# your code here
}

Parsing data from delimited blocks

I have a log file content many blocks /begin CHECK ... /end CHECK like below:
/begin CHECK
Var_AAA
"Description AAA"
DATATYPE UBYTE
Max_Value 255.
ADDRESS 0xFF0011
/end CHECK
/begin CHECK
Var_BBB
"Description BBB"
DATATYPE UBYTE
Max_Value 255.
ADDRESS 0xFF0022
/end CHECK
...
I want to extract the variable name and its address, then write to a new file like this
Name Address
Var_AAA => 0xFF0011
Var_BBB => 0xFF0022
I am just thinking about the ($start, $keyword, $end) to check for each block and extract data after keyword only
#!/usr/bin/perl
use strict;
use warnings;
my $input = 'input.log';
my $output = 'output.out';
my ( $start, $keyword, $end ) = ( '^\/begin CHECK\n\n', 'ADDRESS ', '\/end CHECK' );
my #block;
# open input file for reading
open( my $in, '<', $input ) or die "Cannot open file '$input' for reading: $!";
# open destination file for writing
open( my $out, '>', $output ) or die "Cannot open file '$output' for writing: $!";
print( "copying variable name and it's address from $input to $output \n" );
while ( $in ) { #For each line of input
if ( /$start/i .. /$end/i ) { #Block matching
push #block, $_;
}
if ( /$end/i ) {
for ( #block ) {
if ( /\s+ $keyword/ ) {
print $out join( '', #block );
last;
}
}
#block = ();
}
close $in or die "Cannot close file '$input': $!";
}
close $out or die "Cannot close file '$output': $!";
But I got nothing after execution. Can anyone suggest me with sample idea?
Most everything looks good but it's your start regex that's causing the first problem:
'^\/begin CHECK\n\n'
You are reading lines from the file but then looking for two newlines in a row. That's not going to ever match because a line ends with exactly one newline (unless you change $/, but that's a different topic). If you want to match the send of a line, you can use the $ (or \z) anchor:
'^\/begin CHECK$'
Here's the program I pared down. You can adjust it to do all the rest of the stuff that you need to do:
use v5.10;
use strict;
use warnings;
use Data::Dumper;
my ($start, $keyword, $end) = (qr{^/begin CHECK$}, qr(^ADDRESS ), qr(^/end CHECK));
while (<DATA>) #For each line of input
{
state #block;
chomp;
if (/$start/i .. /$end/i) #Block matching
{
push #block, $_ unless /^\s*$/;
}
if( /$end/i )
{
print Dumper( \#block );
#block = ();
}
}
After that, you're not reading the data. You need to put the filehandle inside <> (the line input operator):
while ( <$in> )
The file handles will close themselves at the end of the program automatically. If you want to close them yourself that's fine but don't do that until you are done. Don't close $in until the while is finished.
using the command prompt in windows. In MacOS or Unix will follow the same logic you can do:
perl -wpe "$/='/end CHECK';s/^.*?(Var_\S+).*?(ADDRESS \S+).*$/$1 => $2\n/s" "your_file.txt">"new.txt
first we set the endLine character to $/ = "/end CHECK".
we then pick only the first Var_ and the first ADDRESS. while deleting everything else in single line mode ie Dot Matches line breaks \n. s/^.*?(Var_\S+).*?(ADDRESS \S+).*$/$1 => $2\n/s.
We then write the results into a new file. ie >newfile.
Ensure to use -w -p -e where -e is for executing the code, -p is for printing and -w is for warnings:
In this code, I did not write the values to a new file ie, did not include the >newfile.txt prt so that you may be able to see the result. If you do include the part, just open the newfile.txt and everything will be printed there
Here are some of the issues with your code
You have while ($in) instead of while ( <$in> ), so your program never reads from the input file
You close your input file handle inside the while read loop, so you can only ever read one record
Your $start regex pattern is '^\/begin CHECK\n\n'. The single quotes make your program search for backslash n backslash n instead of newline newline
Your test if (/\s+ $keyword/) looks for multiple space characters of any sort, followed by a space, followed by ADDRESS—the contents of $keyword. There are no occurrences of ADDRESS preceded by whitespace anywhere in your data
You have also written far too much without testing anything. You should start by writing your read loop on its own and make sure that the data is coming in correctly before proceeding by adding two or three lines of code at a time between tests. Writing 90% of the functionality before testing is a very bad approach.
In future, to help you address problems like this, I would point you to the excellent resources linked on the Stack Overflow Perl tag information page
The only slightly obscure thing here is that the range operator /$start/i .. /$end/i returns a useful value; I have copied it into $status. The first time the operator matches, the result will be 1; the second time 2 etc. The last time is different because it is a string that uses engineering notation like 9E0, so it still evaluates to the correct count but you can check for the last match using /E/. I've used == 1 and /E/ to avoid pushing the begin and end lines onto #block
I don't think there's anything else overly complex here that you can't find described in the Perl language reference
use strict;
use warnings;
use autodie; # Handle bad IO status automatically
use List::Util 'max';
my ($input, $output) = qw/ input.log output.txt /;
open my $in_fh, '<', $input;
my ( #block, #vars );
while ( <$in_fh> ) {
my $status = m{^/begin CHECK}i .. m{^/end CHECK}i;
if ( $status =~ /E/ ) { # End line
#block = grep /\S/, #block;
chomp #block;
my $var = $block[0];
my $addr;
for ( #block ) {
if ( /^ADDRESS\s+(0x\w+)/ ) {
$addr = $1;
last;
}
}
push #vars, [ $var, $addr ];
#block = ();
}
elsif ( $status ) {
push #block, $_ unless $status == 1;
}
}
# Format and generate the output
open my $out_fh, '>', $output;
my $w = max map { length $_->[0] } #vars;
printf $out_fh "%-*s => %s\n", $w, #$_ for [qw/ Name Address / ], #vars;
close $out_fh;
output
Name => Address
Var_AAA => 0xFF0011
Var_BBB => 0xFF0022
Update
For what it's worth, I would have written something like this. It produces the same output as above
use strict;
use warnings;
use autodie; # Handle bad IO status automatically
use List::Util 'max';
my ($input, $output) = qw/ input.log output.txt /;
my $data = do {
open my $in_fh, '<', $input;
local $/;
<$in_fh>;
};
my #vars;
while ( $data =~ m{^/begin CHECK$(.+?)^/end CHECK$}gms ) {
my $block = $1;
next unless $block =~ m{(\w+).+?ADDRESS\s+(0x\w+)}ms;
push #vars, [ $1, $2 ];
}
open my $out_fh, '>', $output;
my $w = max map { length $_->[0] } #vars;
printf $out_fh "%-*s => %s\n", $w, #$_ for [qw/ Name Address / ], #vars;
close $out_fh;

How to print lines from log file which occurs after some particular time

I want to print all the lines which lets say occur after a time whose value is returned by localtime function of perl inside a perl script. I tried something like below:
my $timestamp = localtime();
open(CMD,'-|','cat xyz.log | grep -A1000 \$timestamp' || die ('Could not open');
while (defined(my $line=<CMD>)){
print $line;
}
If I replace the $timestamp in cat command with actaul time component from xyz.log then it print lines but its not printing with $timestamp variable.
Is there any alternative way I can print lines that occurs after current time in log files or how i can improve above command?
Your $timestamp is never evaluated in Perl as it appears only in single quotes. But why go out to shell in order to match a string and process a file? Perl is far better for that.
Here is a direct way first, then a basic approach. A full script is shown in the second example.
Read the file until you get to the line with the pattern, and exit the loop at that point. The next time you access that filehandle you'll be on the next line and can start printing, in another loop.
while (<$fh>) { last if /$timestamp/ }
print while <$fh>;
This prints out the part of the file starting with the line following the one which has the $timestamp anywhere in it. Adjust how exactly to match the timestamp if it is more specific.
Or -- set a flag when a line matches the timestamp, print if flag is set.
use warnings 'all';
use strict;
my $timestamp = localtime();
my $logile = 'xyz.log';
open my $fh, '<', $logfile or die "Can't open $logfile: $!";
my $mark = 0;
while (<$fh>)
{
if (not $mark) {
$mark = 1 if /$timestamp/;
}
else { print }
}
close $fh;
If you're doing the grepping in Shell anyway you might as well do it the other way round and call perl only to give you the result of localtime:
sed <xyz.log -ne"/^$(perl -E'say scalar localtime')/,\$p"
This uses sed's range addressing: first keep it from printing lines unless explicitly told so using -n, the select everything between the first occurrence of the timestamp (I added a ^ for good measure, just in case log lines could contain time stamps in plain text) and the end of file ($) and print it (p).
A pure Perl solution could look like this:
my $timestamp = localtime();
my $found;
open(my $fh, '<', 'xyz.log') or die ('Could not open xyz.log: $!');
while (<$fh>) {
if($found) {
print;
} else {
$found = 1 if /^$timestamp/;
}
}
I would suggest a Perlish approach like below:
open (my $cmd, "<", "xyz.log") or die $!;
#get all lines in an array with each index containing each line
my #log_lines = <$cmd>;
my $index = 0;
foreach my $line (#log_lines){
#write regex to capture time from line
my $rex = qr/regex_for_time_as_per_logline/is;
if ($line =~ /$rex/){
#found the line with expected time
last;
}
$index++;
}
#At this point we have got the index of array from where our expected time starts.
#So all indexes after that have desired lines, which you can write as below
foreach ($index..$#log_lines){
print $log_lines[$_];
}
If you share one of your logline, I could help with the regex.
You may also try this approach:
In this case, I tried to open /var/log/messages then convert each line timestamp to epoch and finding all the lines which has occurred after time()
use Date::Parse;
my $epoch_now = time(); # print epoch current time.
open (my $fh, "</var/log/messages") || die "error: $!\n";
while (<$fh>) {
chomp;
# one log line - looks like this
# Sep 9 08:17:01 localhost rsyslogd: rsyslogd was HUPed
my ($mon, $day, $hour, $min, $sec) = ($_ =~ /(\S+)\s*(\d+)\s*(\d+):(\d+):(\d+)/);
# date string part shouldn't be empty
if (defined($mon) && defined($day)
&& defined($hour) && defined($min)
&& defined($sec)) {
my $epoch_log = str2time("$mon $day $hour:$min:$sec");
if ($epoch_log > $epoch_now) {
print, "\n";
}
}
}

Perl find and replace multiple(huge) strings in one shot

Based on a mapping file, i need to search for a string and if found append the replace string to the end of line.
I'm traversing through the mapping file line by line and using the below perl one-liner, appending the strings.
Issues:
1.Huge find & replace Entries: But the issues is the mapping file has huge number of entries (~7000 entries) and perl one-liners takes ~1 seconds for each entries which boils down to ~1 Hour to complete the entire replacement.
2.Not Simple Find and Replace: Its not a simple Find & Replace. It is - if found string, append the replace string to EOL.
If there is no efficient way to process this, i would even consider replacing rather than appending.
Mine is on Windows 7 64-Bit environment and im using active perl. No *unix support.
File Samples
Map.csv
findStr1,RplStr1
findStr2,RplStr2
findStr3,RplStr3
.....
findStr7000,RplStr7000
input.csv
col1,col2,col3,findStr1,....col-N
col1,col2,col3,findStr2,....col-N
col1,col2,col3,FIND-STR-NOT-EXIST,....col-N
output.csv (Expected Output)
col1,col2,col3,findStr1,....col-N,**RplStr1**
col1,col2,col3,findStr1,....col-N,**RplStr2**
col1,col2,col3,FIND-STR-NOT-EXIST,....col-N
Perl Code Snippet
One-Liner
perl -pe '/findStr/ && s/$/RplStr/' file.csv
open( INFILE, $MarketMapFile ) or die "Error occured: $!";
my #data = <INFILE>;
my $cnt=1;
foreach $line (#data) {
eval {
# Remove end of line character.
$line =~ s/\n//g;
my ( $eNodeBID, $MarketName ) = split( ',', $line );
my $exeCmd = 'perl -i.bak -p -e "/'.$eNodeBID.'\(M\)/ && s/$/,'.$MarketName.'/;" '.$CSVFile;
print "\n $cnt Repelacing $eNodeBID with $MarketName and cmd is $exeCmd";
system($exeCmd);
$cnt++;
}
}
close(INFILE);
To do this in a single pass through your input CSV, it's easiest to store your mapping in a hash. 7000 entries is not particularly huge, but if you're worried about storing all of that in memory you can use Tie::File::AsHash.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Tie::File::AsHash;
tie my %replace, 'Tie::File::AsHash', 'map.csv', split => ',' or die $!;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => $/ })
or die Text::CSV->error_diag;
open my $in_fh, '<', 'input.csv' or die $!;
open my $out_fh, '>', 'output.csv' or die $!;
while (my $row = $csv->getline($in_fh)) {
push #$row, $replace{$row->[3]};
$csv->print($out_fh, $row);
}
untie %replace;
close $in_fh;
close $out_fh;
map.csv
foo,bar
apple,orange
pony,unicorn
input.csv
field1,field2,field3,pony,field5,field6
field1,field2,field3,banana,field5,field6
field1,field2,field3,apple,field5,field6
output.csv
field1,field2,field3,pony,field5,field6,unicorn
field1,field2,field3,banana,field5,field6,
field1,field2,field3,apple,field5,field6,orange
I don't recommend screwing up your CSV format by only appending fields to matching lines, so I add an empty field if a match isn't found.
To use a regular hash instead of Tie::File::AsHash, simply replace the tie statement with
open my $map_fh, '<', 'map.csv' or die $!;
my %replace = map { chomp; split /,/ } <$map_fh>;
close $map_fh;
This is untested code / pseudo-Perl you'll need to polish it (strict, warnings, etc.):
# load the search and replace sreings into memeory
open($mapfh, "<", mapfile);
%maplines;
while ( $mapline = <fh> ) {
($findstr, $replstr) = split(/,/, $mapline);
%maplines{$findstr} = $replstr;
}
close $mapfh;
open($ifh, "<", inputfile);
while ($inputline = <$ifh>) { # read an input line
#input = split(/,/, $inputline); # split it into a list
if (exists $maplines{$input[3]}) { # does this line match
chomp $input[-1]; # remove the new line
push #input, $maplines{$input[3]}; # add the replace str to the end
last; # done processing this line
}
print join(',', #input); # or print or an output file
}
close($ihf)

How do I use variables to do substitution in Perl?

I have several text files, that were once tables in a database, which is now disassembled. I'm trying to reassemble them, which will be easy, once I get them into a usable form. The first file, "keys.text" is just a list of labels, inconsistently formatted. Like:
Sa 1 #
Sa 2
U 328 #*
It's always letter(s), [space], number(s), [space], and sometime symbol(s). The text files that match these keys are the same, then followed by a line of text, also separated, or delimited, by a SPACE.
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
What I'm trying to do in the code below, is match the key from "keys.text", with the same key in the .txt files, and put a tab between the key, and the text. I'm sure I'm overlooking something very basic, but the result I'm getting, looks identical to the source .txt file.
Thanks in advance for any leads or assistance!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open(IN1, "keys.text");
my $key;
# Read each line one at a time
while ($key = <IN1>) {
# For each txt file in the current directory
foreach my $file (<*.txt>) {
open(IN, $file) or die("Cannot open TXT file for reading: $!");
open(OUT, ">temp.txt") or die("Cannot open output file: $!");
# Add temp modified file into directory
my $newFilename = "modified\/keyed_" . $file;
my $line;
# Read each line one at a time
while ($line = <IN>) {
$line =~ s/"\$key"/"\$key" . "\/t"/;
print(OUT "$line");
}
rename("temp.txt", "$newFilename");
}
}
EDIT: Just to clarify, the results should retain the symbols from the keys as well, if there are any. So they'd look like:
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
The regex seems quoted rather oddly to me. Wouldn't
$line =~ s/$key/$key\t/;
work better?
Also, IIRC, <IN1> will leave the newline on the end of your $key. chomp $key to get rid of that.
And don't put parentheses around your print args, esp when you're writing to a file handle. It looks wrong, whether it is or not, and distracts people from the real problems.
if Perl is not a must, you can use this awk one liner
$ cat keys.txt
Sa 1 #
Sa 2
U 328 #*
$ cat mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
$ awk 'FNR==NR{ k[$1 SEP $2];next }($1 SEP $2 in k) {$2=$2"\t"}1 ' keys.txt mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
Using split rather than s/// makes the problem straightforward. In the code below, read_keys extracts the keys from keys.text and records them in a hash.
Then for all files named on the command line, available in the special Perl array #ARGV, we inspect each line to see whether it begins with a key. If not, we leave it alone, but otherwise insert a TAB between the key and the text.
Note that we edit the files in-place thanks to Perl's handy -i option:
-i[extension]
specifies that files processed by the <> construct are to be edited in-place. It does this by renaming the input file, opening the output file by the original name, and selecting that output file as the default for print statements. The extension, if supplied, is used to modify the name of the old file to make a backup copy …
The line split " ", $_, 3 separates the current line into exactly three fields. This is necessary to protect whitespace that's likely to be present in the text portion of the line.
#! /usr/bin/perl -i.bak
use warnings;
use strict;
sub usage { "Usage: $0 text-file\n" }
sub read_keys {
my $path = "keys.text";
open my $fh, "<", $path
or die "$0: open $path: $!";
my %key;
while (<$fh>) {
my($text,$num) = split;
++$key{$text}{$num} if defined $text && defined $num;
}
wantarray ? %key : \%key;
}
die usage unless #ARGV;
my %key = read_keys;
while (<>) {
my($text,$num,$line) = split " ", $_, 3;
$_ = "$text $num\t$line" if defined $text &&
defined $num &&
$key{$text}{$num};
print;
}
Sample run:
$ ./add-tab input
$ diff -u input.bak input
--- input.bak 2010-07-20 20:47:38.688916978 -0500
+++ input 2010-07-20 21:00:21.119531937 -0500
## -1,3 +1,3 ##
-Sa 1 # Random line of text follows.
-Sa 2 This text is just as random.
-U 328 #* Continuing text...
+Sa 1 # Random line of text follows.
+Sa 2 This text is just as random.
+U 328 #* Continuing text...
Fun answers:
$line =~ s/(?<=$key)/\t/;
Where (?<=XXXX) is a zero-width positive lookbehind for XXXX. That means it matches just after XXXX without being part of the match that gets substituted.
And:
$line =~ s/$key/$key . "\t"/e;
Where the /e flag at the end means to do one eval of what's in the second half of the s/// before filling it in.
Important note: I'm not recommending either of these, they obfuscate the program. But they're interesting. :-)
How about doing two separate slurps of each file. For the first file you open the keys and create a preliminary hash. For the second file then all you need to do is add the text to the hash.
use strict;
use warnings;
my $keys_file = "path to keys.txt";
my $content_file = "path to content.txt";
my $output_file = "path to output.txt";
my %hash = ();
my $keys_regex = '^([a-zA-Z]+)\s*\(d+)\s*([^\da-zA-Z\s]+)';
open my $fh, '<', $keys_file or die "could not open $key_file";
while(<$fh>){
my $line = $_;
if ($line =~ /$keys_regex/){
my $key = $1;
my $number = $2;
my $symbol = $3;
$hash{$key}{'number'} = $number;
$hash{$key}{'symbol'} = $symbol;
}
}
close $fh;
open my $fh, '<', $content_file or die "could not open $content_file";
while(<$fh>){
my $line = $_;
if ($line =~ /^([a-zA-Z]+)/){
my $key = $1;
// strip content_file line from keys/number/symbols to leave text
line =~ s/^$key//;
line =~ s/\s*$hash{$key}{'number'}//;
line =~ s/\s*$hash{$key}{'symbol'}//;
$line =~ s/^\s+//g;
$hash{$key}{'text'} = $line;
}
}
close $fh;
open my $fh, '>', $output_file or die "could not open $output_file";
for my $key (keys %hash){
print $fh $key . " " . $hash{$key}{'number'} . " " . $hash{$key}{'symbol'} . "\t" . $hash{$key}{'text'} . "\n";
}
close $fh;
I haven't had a chance to test it yet and the solution seems a little hacky with all the regex but might give you an idea of something else you can try.
This looks like the perfect place for the map function in Perl! Read in the entire text file into an array, then apply the map function across the entire array. The only other thing you might want to do is use the quotemeta function to escape out any possible regular expressions in your keys.
Using map is very efficient. I also read the keys into an array in order to not have to keep opening and closing the keys file in my loop. It's an O^2 algorithm, but if your keys aren't that big, it shouldn't be too bad.
#! /usr/bin/env perl
use strict;
use vars;
use warnings;
open (KEYS, "keys.text")
or die "Cannot open 'keys.text' for reading\n";
my #keys = <KEYS>;
close (KEYS);
foreach my $file (glob("*.txt")) {
open (TEXT, "$file")
or die "Cannot open '$file' for reading\n";
my #textArray = <TEXT>;
close (TEXT);
foreach my $line (#keys) {
chomp $line;
map($_ =~ s/^$line/$line\t/, #textArray);
}
open (NEW_TEXT, ">$file.new") or
die qq(Can't open file "$file" for writing\n);
print TEXT join("\n", #textArray) . "\n";
close (TEXT);
}