Push, big file. Correct and improvement

Push, big file. Correct and improvement - perl

dummy.pepmasses
YCL049C 1 511.2465 0 0 MFSK
YCL049C 2 4422.3098 0 0 YLVTASSLFVA
YCL049C 3 1131.5600 0 0 DFYQVSFVK
YCL049C 4 1911.0213 0 0 SIAPAIVNSSVIFHDVSR
YCL049C 5 774.4059 0 0 GVAMGNVK
YCL049C 6 261.1437 0 0 SR
my $dummyfile = "dummy.pepmasses"; #filename defined here
my #mzco = ();
open (IFILE, $dummyfile) or die "unable to open file $dummyfile\n ";
while (my $line = $dummyfile){
#read each line in file
chomp $line;
my $mz_value = (split/\s+/,$line)[3]; #pick column 3rd at every line
$mz_value = join "\n"; # add "\n" for data
push (#mzco,$mz_value); #add them all in one array #mzco
}
print "#mzco";
close IFILE;
There should be better way to express this one. How can it be ?
I want to pick up the third column and push it into an array. Are there better methods?

I'll just go through your code and comment
open (IFILE, $dummyfile) or die "unable to open file $dummyfile\n ";
You should use 3-argument open with explicit mode, and a lexical file handle. Also, you should not include newline in the die message unless you want to suppress line number. You should also include the error, $!.
open my $fh, "<", $dummyfile or die "Unable to open $dummyfile: $!";
while (my $line = $dummyfile){
#read each line in file
No, this just copies the file name. To read from the file handle, do this:
while (my $line = <IFILE>) {
Or <$fh> if you use a lexical file handle.
chomp $line;
my $mz_value = (split/\s+/,$line)[3]; #pick column 3rd at every line
This is actually the 4th column, since indexes start at zero 0.
$mz_value = join "\n"; # add "\n" for data
join does not work that way. It is join EXPR, LIST to join a list of values into a string. You want the concatenation operator .:
$mz_value = $mz_value . "\n";
Or more appropriately:
$mz_value .= "\n";
But why do it that way? It is simpler to just add the newline when you print.
print "#mzco";
You can do this:
print "$_\n" for #mzco;
Or if you are feeling daring:
use feature 'say';
say for #mzco;
And just to show you the power of Perl, this program can be reduced to a one-liner, using a lot of built-in features:
perl -lane ' print $F[3] ' dummy.pepmasses
-l chomp lines, add newline (by default) to print
-n put while (<>) loop around code: read input file or stdin
-a autosplit each line into #F.
The program as a file would look like this:
$\ = $/; # set output record separator to input record separator
while (<>) {
chomp;
my #F = split;
print $F[3];
}

Related

Split file Perl

I want to split parts of a file. Here is what the start of the file looks like (it continues in same way):
Location Strand Length PID Gene
1..822 + 273 292571599 CDS001
906..1298 + 130 292571600 trxA
I want to split in Location column and subtract 822-1 and do the same for every row and add them all together. So that for these two results the value would be: (822-1)+1298-906) = 1213
How?
My code right now, (I don't get any output at all in the terminal, it just continue to process forever):
use warnings;
use strict;
my $infile = $ARGV[0]; # Reading infile argument
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $line2 = <$IN>;
my $coding = 0; # Initialize coding variable
while(my $line = $line2){ # reading the file line by line
# TODO Use split and do the calculations
my #row = split(/\.\./, $line);
my #row2 = split(/\D/, $row[1]);
$coding += $row2[0]- $row[0];
}
print "total amount of protein coding DNA: $coding\n";
So what I get from my code if I put:
print "$coding \n";
at the end of the while loop just to test is:
821
1642
And so the first number is correct (822-1) but the next number doesn't make any sense to me, it should be (1298-906). What I want in the end outside the loop:
print "total amount of protein coding DNA: $coding\n";
is the sum of all the subtractions of every line i.e. 1213. But I don't get anything, just a terminal that works on forever.

As a one-liner:
perl -nE '$c += $2 - $1 if /^(\d+)\.\.(\d+)/; END { say $c }' input.txt
(Extracting the important part of that and putting it into your actual script should be easy to figure out).

Explicitly opening the file makes your code more complicated than it needs to be. Perl will automatically open any files passed on the command line and allow you to read from them using the empty file input operator, <>. So your code becomes as simple as this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $total;
while (<>) {
my ($min, $max) = /(\d+)\.\.(\d+)/;
next unless $min and $max;
$total += $max - $min;
}
say $total;
If this code is in a file called adder and your input data is in add.dat, then you run it like this:
$ adder add.dat
1213
Update: And, to explain where you were going wrong...
You only ever read a single line from your file:
my $line2 = <$IN>;
And then you continually assign that same value to another variable:
while(my $line = $line2){ # reading the file line by line
The comment in this line is wrong. I'm not sure where you got that line from.
To fix your code, just remove the my $line2 = <$IN> line and replace your loop with:
while (my $line = <$IN>) {
# your code here
}

print hashes with values from different files

I want to create output file that has values from file 1 and file 2.
The line from file 1:
chr1 Cufflinks exon 708356 708487 1000 - .
gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "5"; FPKM
"3.1300591420"; frac "1.000000"; conf_lo "2.502470"; conf_hi
"3.757648"; cov "7.589085"; chr1Cufflinks exon 708356
708487 . - . gene_id "XLOC_001284"; transcript_id
"TCONS_00007667"; exon_number "7"; gene_name "LOC100288069"; oId
"CUFF.15.2"; nearest_ref "NR_033908"; class_code "j"; tss_id
"TSS2981";
The line from file 2:
CUFF.48557
chr4:160253850-160259462:160259621-160260265:160260507-160262715
The second column from this file is unique id (uniq_id).
I want to get output file in the following format:
transcript_id(CUFF_id) uniq_id gene_id(XLOC_ID) FPKM
My script takes XLOC_ID and FPKM values from first file and print them together with two columns from the second file.
#!/usr/bin/perl -w
use strict;
my $v_merge_gtf = shift #ARGV or die $!;
my $unique_gtf = shift #ARGV or die $!;
my %fpkm_hash;
my %xloc_hash;
open (FILE, "$v_merge_gtf") or die $!;
while (<FILE>) {
my $line = $_;
chomp $line;
if ($line =~ /[a-z]/) {
my #array = split("\t", $line);
if ($array[2] eq 'exon') {
my $id = $array[8];
if ($id =~ /transcript_id \"(CUFF\S+)/) {
$id = $1;
$id =~ s/\"//g;
$id =~ s/;//;
}
my $fpkm = $array[8];
if ($fpkm =~ /FPKM \"(\S+)/) {
$fpkm = $1;
$fpkm =~ s/\"//g;
$fpkm =~ s/;//;
}
my $xloc = $array[17];
if ($xloc =~ /gene_id \"(XLOC\S+)/) {
$xloc = $1;
$xloc =~ s/\"//g;
$xloc =~ s/;//;
}
$fpkm_hash{$id} = $fpkm;
$xloc_hash{$id} = $xloc;
}
}
}
close FILE;
open (FILE, "$unique_gtf") or die $!;
while (<FILE>) {
my $line = $_;
chomp $line;
if ($line =~ /[a-z]/) {
my #array = split("\t", $line);
my $id = $array[0];
my $uniq = $array[1];
print $id . "\t" . $uniq . "\t" . $xloc_hash{$id} . "\t" . $fpkm_hash{$id} . "\n";
}
}
close FILE;
I initialized hashes outside of the files, but I get the following error for each CUFF values:
CUFF.24093
chr17:3533641-3539345:3527526-3533498:3526786-3527341:3524707-3526632
Use of uninitialized value in concatenation (.) or string at ex_1.pl
line 55, line 9343.
Use of uninitialized value in concatenation (.) or string at ex_1.pl
line 55, line 9343.
How can I fix this issue?
Thank you!

I think the warning message is because the $id key, (CUFF.24093), you get on line 9343 of the second file isn't contained in the hashes you created in the first file.
Is it possible that an ID in the second file isn't contained in the first file? That seems to be the case here.
If so, and you just want to skip over this unknown ID, you could add a line to your program like:
my $id = $array[0];
my $uniq = $array[1];
next unless exists $fpkm_hash{$id}; # add this line
print $id . "\t" . $uniq . "\t" . $xloc_hash{$id} . "\t" . $fpkm_hash{$id} . "\n";
This will bypass the following print statement and go back to the top of the while loop and read in the next line and continue processing.
It depends on what action you want to take if you encounter an unknown ID.
Update: I thought I might make some observations/improvements to your code.
my $v_merge_gtf = shift #ARGV or die $!;
my $unique_gtf = shift #ARGV or die $!;
The error variable $! serves no purpose here (this is a fact I only recently discovered even after 14 years using Perl). $! is only set for system calls, (where you are involving the operating system).The most common are open and close for files, and opendir and closedir for directories. If an error occurs in opening/closing a file or a directory, $! will contain the error message. (See in my included code how I handled this - I created a message, $usage to be printed if the shift didn't succeed.
Instead of using 2 hashes to store the information, I used 1 hash,%data. The advantage is that it will use less memory, (because its only storing 1 set of keys instead of 2), Though, you could use the 2 if you like.
I used the recommended 3 argument (filehandle, mode, filename) form for opening the files. The 2 argument approach you used is outdated and less safe (for reasons I won't go into detail here). Also, the lexical filehandles I used, my $mrg and my $unique are the newer ways to create filehandles (instead of usingFILEfor your 2 opens).
You can directly assign to $linein your while loop like while (my $line = <FILE>) instead of the way you did it. In my sample program, I didn't assign to $line, but instead relied on the default variable $_. (It simplifies the 2 following statements, next unless /\S/; my #array = split /\t/;). I didn't chomp for the first file because you're only parsing inside the string and aren't using anything from the end of the string.chomp is necessary for the second while loop because the second variable my $uniq = ... would have a newline at its end if it wasn't removed by chomp.
I didn't know what you meant by this statement, if ($line =~ /[a-z]/). I am assuming you wanted to check for empty lines and only process lines with non-space data. That's why I wrote next unless /\S/;instead. (says to skip the following statements and got to the top of the while loop and read the next record).
Your first while loop worked because you had no errors in your input file. If there had errors, the way you wrote the code could have been a problem.
The statementmy $id = $array[8]; gives $id a value that would have been wrongly used if the following if statement had been false. (The same thing for the 2 other variables you want to capture,$fpkm and $xloc). You can see in my code example how I handled this.
In my code, I died if the match didn't succeed, You might not want todie but say match or next to try the next line of data. It depends on how you would want to handle a failed match.
And in this line$array[8] =~ /gene_id "(CUFF\S+)";/, Note that I put the ";following the captured data, so there is no need to remove it from the captured data (as you did in your substitutions)
Well, I know this is a long comment on your code, but I hope you get some good ideas about why I recommended the changes given.
or die "Could not find ID in $v_merge_gtf (line# $.)";
$. is the line number of the file being read.
#!/usr/bin/perl
use warnings;
use strict;
my $usage = "USAGE: perl $0 merge_gtf_file unique_gtf_file\n";
my $v_merge_gtf = shift #ARGV or die $usage;
my $unique_gtf = shift #ARGV or die $usage;
my %data;
open my $mrg, '<', $v_merge_gtf or die $!;
while (<$mrg>) {
next unless /\S/;
my #array = split /\t/;
if ($array[2] eq 'exon') {
$array[8] =~ /gene_id "(CUFF\S+)";/
or die "Could not find ID in $v_merge_gtf (line# $.)";
my $id = $1;
$array[8] =~ /FPKM "(\S+)";/
or die "Could not find FPKM in $v_merge_gtf (line# $.)";
my $fpkm = $1;
$array[17] =~ /gene_id "(XLOC\S+)";/
or die "Could not find XLOC in $v_merge_gtf (line# $.)";
my $xloc = $1;
$data{$id}{fpkm} = $fpkm;
$data{$id}{xloc} = $xloc;
}
}
close $mrg or die $!;
open my $unique, '<', $unique_gtf or die $!;
while (<$unique>) {
next unless /\S/;
chomp;
my ($id, $uniq) = split /\t/;
print join("\t", $id, $uniq, $data{$id}{fpkm}, $data{$id}{xloc}), "\n";
}
close $unique or die $!;

The system() of Perl "paused". Caused by $ARGV[]?

I was stuck when combining the BLAST command into perl script. The problem is that the command line paused when the PART II begin.
PART I is used to crop the fasta sequence.
PART II is used to do BLAST with the file generated by PART I.
Both the two parts can run well individually, but met the "pause" problem when combining together.
I guess it was because the $ARGV[1] and $ARGV[3] generated by part I cannot be used in part II. I dont know how to fix, though I tried a lot.
Thanks!
#! /usr/bin/perl -w
use strict;
#### PART I
die "usage:4files fasta1 out1 fasta2 out2\n" unless #ARGV==4;
open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";
open OUT,">$ARGV[1]" || die "no out\n";
open (S2, "$ARGV[2]") || die "cannot open FASTA file to read: $!";
open OUT2,">$ARGV[3]" || die "no out2\n";
my %s;# a hash of arrays, to hold each line of sequence
my %seq; #a hash to hold the AA sequences.
my $key;
print "how long is the N-terminal(give number,e.g. 30. whole length input \"0\") \n";
chomp(my $nl=<STDIN>);
##delete "\n" for seq.
local $/ = ">";
<S>;
while (<S>){ #Read the FASTA file.
chomp;
my #line=split/\n/;
print OUT ">",$line[0],"\n";
splice #line,0,1;
#print OUT join ("",#line),"\n";
##line = join("",#line);
#print #line,"\n";
if ($nl == 0){ #whole length
my $seq=join("",#line);
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
#push #{$s{$key}},#amac;
print OUT #amac,"\n";
}
else { # extract inital aa by number ##Guanhua
my $seq=join("",#line);
#print $seq,"\n";
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
splice #amac,$nl; ##delete from the N to end
#print #amac,"\n";
#push (#{$s{$key}}, #amac);
print OUT #amac,"\n";
}
}
<S2>;
while (<S2>){ #Read the FASTA file.
chomp;
my #line=split/\n/;
print OUT2 ">",$line[0],"\n";
splice #line,0,1;
#print OUT join ("",#line),"\n";
##line = join("",#line);
#print #line,"\n";
if ($nl == 0){ #whole length
my $seq=join("",#line);
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
#push #{$s{$key}},#amac;
print OUT2 #amac,"\n";
}
else { # extract inital aa by number ##Guanhua
my $seq=join("",#line);
#print $seq,"\n";
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
splice #amac,$nl; ##delete from the N to end
#print #amac,"\n";
#push (#{$s{$key}}, #amac);
print OUT2 #amac,"\n";
}
}
##### PART II
print "nucl or prot?\n";
chomp(my $tp = <STDIN>);
system ("makeblastdb -in $ARGV[1] -dbtype prot");
system ("makeblastdb -in $ARGV[3] -dbtype $tp");
print "blast type? (blastp,blastn)\n";
chomp(my $cmd = <STDIN>);
system ("blastp -query $ARGV[1] -db $ARGV[3] -outfmt 6 -evalue 1e-3 -out 12.out ");
system ("$cmd -db $ARGV[1] -query $ARGV[3] -outfmt 6 -evalue 1e-3 -out 21.out ");

You changed the way perl reads from 'STDIN' when you set '$/' in this line:
local $/ = ">";
The easiest way to fix this is to add a left bracket right before that line and a right bracket just before the '##### PART II' comment:
{
local $/ = ">";
...
...
}
##### PART II
(I think theoretically, you could put a ">" at the end of the text you input, but that seems strange, so I wouldn't do it)
That will fix your problem. But something that should be addressed is some of the style choices you made. The two big chunks of code in the middle are both identical as far as I can tell and should probably be put into a subroutine and then called twice. This will eliminate duplication and is less error prone.
You should also use the three argument open call to open files.

Perl - adding new line and tab characters after a fixed number of characters ina file?

I have a Perl question. I have a file each line of this file contains a different number of As Ts Gs and Cs
The file looks like below
ATCGCTGASTGATGCTG
GCCTAGCCCTTAGC
GTTCCATGCCCATAGCCAAATAAA
I would like to add line number for each line
Then insert a \n every 6 characters and then on each of the new rows created put an
Empty space every 3 characters
Example of the output should be
Line NO 1
ATC GCT
GAS TGA
TGC TG
Line NO 2
GCC TAG
CCC TTA
GC
I have come up with the code below:
my $count = 0;
my $line;
my $row;
my $split;
open(F, "Data.txt") or die "Can't read file: $!";
open (FH, " > UpDatedData.txt") or die "Can't write new file: $!";
while (my $line = <F>) {
$count ++ ;
$row = join ("\n", ( $line =~ /.{1,6}/gs));
$split = join ("\t", ( $row =~ /.{3}/gs ));
print FH "Line NO\t$count\n$split\n";
}
close F;
close FH;
However
It gives the following out put
Line NO 1
ATC GCT
GA STG A
T GCT G
Line NO 2
GCC TAG
CC CTT A
G C
This must have something with the \n being counted as a character in this line of code
$split = join ("\t", ( $row =~ /.{3}/gs ));
Any one got any idea how to get around this problem?
Any help would be greatly appreciated.
Thanks in advance
Sinead

This should solve your problem:
use strict;
use warnings;
while (<DATA>) {
s/(.{3})(.{0,3})?/$1 $2 /g;
s/(.{7}) /$1\n/g;
printf "Line NO %d\n%s\n", $., $_;
}
__DATA__
ATCGCTGASTGATGCTG
GCCTAGCCCTTAGC
GTTCCATGCCCATAGCCAAATAAA

This is a one-liner:
perl -plwe 's/(.{3})(.{0,3})/$1 $2\n/g' data.txt
The regex looks for 3 characters (does not match newline), followed by 0-3 characters and captures both of those, then inserts a space between them and newline after.
To keep track of the line numbers, you can add
s/^/Line NO $.\n/;
Which will enumerate based on input line number. If you prefer, you can keep a simple counter, such as ++$i.
-l option will handle newlines for you.
You can also do it in two stages, like so:
perl -plwe's/.{6}\K/\n/g; s/^.{3}\K/ /gm;'
Using the \K (keep) escape sequence here to keep the matched part of the string, and then simply inserting a newline after 6 characters, and then a space 3 characters after "line beginnings", which with the /m modifier also includes newlines.
So, in short:
perl -plwe 's/.{6}\K/\n/g; s/^.{3}\K/ /gm; s/^/Line NO $.\n/;' data.txt
perl -plwe 's/(.{3})(.{0,3})/$1 $2\n/g; s/^/Line NO $.\n/;' data.txt

Another solution. Note that it uses lexical filehandles and three argument form of open.
#!/usr/bin/perl
use warnings;
use strict;
open my $IN, '<', 'Data.txt' or die "Can't read file: $!";
open my $OUT, '>', 'UpDatedData.txt' or die "Can't write new file: $!";
my $count = 0;
while (my $line = <$IN>) {
chomp $line;
$line =~ s/(...)(...)/$1 $2\n/g; # Create pairs of triples
$line =~ s/(\S\S\S)(\S{1,2})$/$1 $2\n/; # A triple plus something at the end.
$line .= "\n" if $line !~ /\n$/; # A triple or less at the end.
$count++;
print $OUT "Line NO\t$count\n$line\n";
}
close $OUT;

How do I use variables to do substitution in Perl?

I have several text files, that were once tables in a database, which is now disassembled. I'm trying to reassemble them, which will be easy, once I get them into a usable form. The first file, "keys.text" is just a list of labels, inconsistently formatted. Like:
Sa 1 #
Sa 2
U 328 #*
It's always letter(s), [space], number(s), [space], and sometime symbol(s). The text files that match these keys are the same, then followed by a line of text, also separated, or delimited, by a SPACE.
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
What I'm trying to do in the code below, is match the key from "keys.text", with the same key in the .txt files, and put a tab between the key, and the text. I'm sure I'm overlooking something very basic, but the result I'm getting, looks identical to the source .txt file.
Thanks in advance for any leads or assistance!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open(IN1, "keys.text");
my $key;
# Read each line one at a time
while ($key = <IN1>) {
# For each txt file in the current directory
foreach my $file (<*.txt>) {
open(IN, $file) or die("Cannot open TXT file for reading: $!");
open(OUT, ">temp.txt") or die("Cannot open output file: $!");
# Add temp modified file into directory
my $newFilename = "modified\/keyed_" . $file;
my $line;
# Read each line one at a time
while ($line = <IN>) {
$line =~ s/"\$key"/"\$key" . "\/t"/;
print(OUT "$line");
}
rename("temp.txt", "$newFilename");
}
}
EDIT: Just to clarify, the results should retain the symbols from the keys as well, if there are any. So they'd look like:
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...

The regex seems quoted rather oddly to me. Wouldn't
$line =~ s/$key/$key\t/;
work better?
Also, IIRC, <IN1> will leave the newline on the end of your $key. chomp $key to get rid of that.
And don't put parentheses around your print args, esp when you're writing to a file handle. It looks wrong, whether it is or not, and distracts people from the real problems.

if Perl is not a must, you can use this awk one liner
$ cat keys.txt
Sa 1 #
Sa 2
U 328 #*
$ cat mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
$ awk 'FNR==NR{ k[$1 SEP $2];next }($1 SEP $2 in k) {$2=$2"\t"}1 ' keys.txt mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...

Using split rather than s/// makes the problem straightforward. In the code below, read_keys extracts the keys from keys.text and records them in a hash.
Then for all files named on the command line, available in the special Perl array #ARGV, we inspect each line to see whether it begins with a key. If not, we leave it alone, but otherwise insert a TAB between the key and the text.
Note that we edit the files in-place thanks to Perl's handy -i option:
-i[extension]
specifies that files processed by the <> construct are to be edited in-place. It does this by renaming the input file, opening the output file by the original name, and selecting that output file as the default for print statements. The extension, if supplied, is used to modify the name of the old file to make a backup copy …
The line split " ", $_, 3 separates the current line into exactly three fields. This is necessary to protect whitespace that's likely to be present in the text portion of the line.
#! /usr/bin/perl -i.bak
use warnings;
use strict;
sub usage { "Usage: $0 text-file\n" }
sub read_keys {
my $path = "keys.text";
open my $fh, "<", $path
or die "$0: open $path: $!";
my %key;
while (<$fh>) {
my($text,$num) = split;
++$key{$text}{$num} if defined $text && defined $num;
}
wantarray ? %key : \%key;
}
die usage unless #ARGV;
my %key = read_keys;
while (<>) {
my($text,$num,$line) = split " ", $_, 3;
$_ = "$text $num\t$line" if defined $text &&
defined $num &&
$key{$text}{$num};
print;
}
Sample run:
$ ./add-tab input
$ diff -u input.bak input
--- input.bak 2010-07-20 20:47:38.688916978 -0500
+++ input 2010-07-20 21:00:21.119531937 -0500
## -1,3 +1,3 ##
-Sa 1 # Random line of text follows.
-Sa 2 This text is just as random.
-U 328 #* Continuing text...
+Sa 1 # Random line of text follows.
+Sa 2 This text is just as random.
+U 328 #* Continuing text...

Fun answers:
$line =~ s/(?<=$key)/\t/;
Where (?<=XXXX) is a zero-width positive lookbehind for XXXX. That means it matches just after XXXX without being part of the match that gets substituted.
And:
$line =~ s/$key/$key . "\t"/e;
Where the /e flag at the end means to do one eval of what's in the second half of the s/// before filling it in.
Important note: I'm not recommending either of these, they obfuscate the program. But they're interesting. :-)

How about doing two separate slurps of each file. For the first file you open the keys and create a preliminary hash. For the second file then all you need to do is add the text to the hash.
use strict;
use warnings;
my $keys_file = "path to keys.txt";
my $content_file = "path to content.txt";
my $output_file = "path to output.txt";
my %hash = ();
my $keys_regex = '^([a-zA-Z]+)\s*\(d+)\s*([^\da-zA-Z\s]+)';
open my $fh, '<', $keys_file or die "could not open $key_file";
while(<$fh>){
my $line = $_;
if ($line =~ /$keys_regex/){
my $key = $1;
my $number = $2;
my $symbol = $3;
$hash{$key}{'number'} = $number;
$hash{$key}{'symbol'} = $symbol;
}
}
close $fh;
open my $fh, '<', $content_file or die "could not open $content_file";
while(<$fh>){
my $line = $_;
if ($line =~ /^([a-zA-Z]+)/){
my $key = $1;
// strip content_file line from keys/number/symbols to leave text
line =~ s/^$key//;
line =~ s/\s*$hash{$key}{'number'}//;
line =~ s/\s*$hash{$key}{'symbol'}//;
$line =~ s/^\s+//g;
$hash{$key}{'text'} = $line;
}
}
close $fh;
open my $fh, '>', $output_file or die "could not open $output_file";
for my $key (keys %hash){
print $fh $key . " " . $hash{$key}{'number'} . " " . $hash{$key}{'symbol'} . "\t" . $hash{$key}{'text'} . "\n";
}
close $fh;
I haven't had a chance to test it yet and the solution seems a little hacky with all the regex but might give you an idea of something else you can try.

This looks like the perfect place for the map function in Perl! Read in the entire text file into an array, then apply the map function across the entire array. The only other thing you might want to do is use the quotemeta function to escape out any possible regular expressions in your keys.
Using map is very efficient. I also read the keys into an array in order to not have to keep opening and closing the keys file in my loop. It's an O^2 algorithm, but if your keys aren't that big, it shouldn't be too bad.
#! /usr/bin/env perl
use strict;
use vars;
use warnings;
open (KEYS, "keys.text")
or die "Cannot open 'keys.text' for reading\n";
my #keys = <KEYS>;
close (KEYS);
foreach my $file (glob("*.txt")) {
open (TEXT, "$file")
or die "Cannot open '$file' for reading\n";
my #textArray = <TEXT>;
close (TEXT);
foreach my $line (#keys) {
chomp $line;
map($_ =~ s/^$line/$line\t/, #textArray);
}
open (NEW_TEXT, ">$file.new") or
die qq(Can't open file "$file" for writing\n);
print TEXT join("\n", #textArray) . "\n";
close (TEXT);
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Push, big file. Correct and improvement - perl

Related

Split file Perl

print hashes with values from different files

The system() of Perl "paused". Caused by $ARGV[]?

Perl - adding new line and tab characters after a fixed number of characters ina file?

How do I use variables to do substitution in Perl?

Categories

Resources