Opening a file inside a subroutine for read/write in Perl

Opening a file inside a subroutine for read/write in Perl - perl

I am trying to open a file inside a subroutine to basically substitute some lines in the file. But since, it was not working, I tried a simpler way of printing a line instead of substitute, for debug purposes. Following is the subroutine code.
sub replace {
while (<INPUT_FILE>){
my $cell = $_[0];
our $rpl;
if ($_=~ /^TASK\|VALUE = (.*)/ ) {
my $task = $1;
chomp $task;
$rpl = $cell . '_' . $task . '_bunch_rpl';
print "000: $rpl\n";
}
elsif ($_=~ /^(.*)\|VALUE = (.*)/ ) {
my $line = $_;
chomp $line;
my $ip_var = $1;
my $ip_val = $2;
chomp $ip_var;
chomp $ip_val;
my $look= $ip_var."|VALUE";
open(REPLAY_FILE, "+<$rpl") || die "\ncannot open $rpl\n";
while (my $rpl_sub = <REPLAY_FILE>) {
if ($rpl_sub =~ /^$line/) {
print "\n 111: $ip_val";
}
}
close REPLAY_FILE;
}
elsif ($_=~ /^\s*$/) {
print "\n";
return ;
}
}
}
The code prints the following as of now.
000: lfr_task62_bunch_rpl
111: 2.0.9.0
111: INLINE
111: POWER
000: aaa_task14_bunch_rpl
Expected output is:
000: lfr_task62_bunch_rpl
111: 2.0.9.0
111: INLINE
111: POWER
000: aaa_task14_bunch_rpl
111: 0.45
111: NO
The input sample is:
TASK_CELL_NAME|VALUE = lfr
TASK|VALUE = task62
TASK_VERSION|VALUE = 2.0.9.0
CHIP_PKG_TYPE|VALUE = INLINE
JUNK_LINE = JUNK
JUNK_LINE = JUNK
FULL_ESD|VALUE = POWER
TASK_CELL_NAME|VALUE = aaa
TASK|VALUE = task14
CUSTOM_CELL_DENSITY|VALUE = 0.45
CUSTOM_CELL_SS|VALUE = NO
Can someone tell me the mistake I am doing here?
UPDATE: Main code below
my #cell_names;
open(INPUT_FILE, "<$ip_file") || die "\n!!!ERROR OPENING INPUT FILE. EXITING SCRIPT!!!\n";
while (<INPUT_FILE>) {
if ($_=~ /(.*) =\n/ ) {
$mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n");
exit;
}
elsif ($_=~ /(.*) =\s+\n/ ) {
$mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n");
exit;
}
elsif ($_=~ /(.*) = \s+(.*)/ ) {
$mw -> messageBox(-message=> "\nFormat not correct on line $. of input file. Exiting script\n");
exit;
}
elsif ($_=~ /^TASK_CELL_NAME\|VALUE = (.*)/ ) {
my $cell_name = $1;
chomp $cell_name;
unless(grep( /^$cell_name $/, #cell_names )) {
push #cell_names, "$cell_name ";
#$count++;
#print "\nCELL NAME: $cell_name\n";
replace($cell_name);
}
}
}
close INPUT_FILE;
Update: lfr_task62_bunch_rpl before running code:
# Select fund
FUND|VALUE = mmi
# Select bank
BANK|VALUE = citi
# Select cell name
TASK_CELL_NAME|VALUE = lfr
# Select task
TASK|VALUE = task62
# Select task version
TASK_VERSION|VALUE = 1.0.9.0
# Select fund type
FULL_ESD|VALUE = MUTUAL
# Select customer premium
CUSTOM_CELL_SS|VALUE = YES
# Select customer brand density
CUSTOM_CELL_DENSITY|VALUE = 0.76
# Select card chip
CHIP_PKG_TYPE|VALUE|VALUE = OUTLINE
Expected lfr_task62_bunch_rpl after running code:
# Select fund
FUND|VALUE = mmi
# Select bank
BANK|VALUE = citi
# Select cell name
TASK_CELL_NAME|VALUE = lfr
# Select task
TASK|VALUE = task62
# Select task version
TASK_VERSION|VALUE = 2.0.9.0
# Select fund type
FULL_ESD|VALUE = POWER
# Select customer premium
CUSTOM_CELL_SS|VALUE = YES
# Select customer brand density
CUSTOM_CELL_DENSITY|VALUE = 0.76
# Select card chip
CHIP_PKG_TYPE|VALUE|VALUE = INLINE

It's not really clear what this code is supposed to do. But I can immediately see a few problems with the logic. Let's step through a few iterations of the loop, using your sample data file.
The first time, the line of data read in is:
TASK_CELL_NAME|VALUE = lf
So that matches on your second regex match. You set a few variables and then (because $ip_var is equal to "TASK_CELL_NAME") you skip to the else clause and close a filehandle that isn't open.
Next time round, we read:
TASK|VALUE = task62
That matches your first regex match. The variable $rpl_file is set to "XXX_lfr_bunch_rpl" (where 'XXX' is the parameter passed to the subroutine - obviously, I don't know what that is). You print a "000" line with that value and open the file with that name in r/w mode.
Third time round, we get this data:
TASK_VERSION|VALUE = 2.0.9.0
This matches your second regex and because $ip_var isn't equal to "TASK_CELL_NAME" we go into the if clause. This reads from your open filehandle and prints a "111" line. But this generates a warning if you have use warnings switched on as the line includes the value of $rpl_file which is currently defined. It was set the last time around the loop, but because the variable is declared inside the loop, it has now lost its value. We then close the filehandle.
The fourth iteration will be the last one that's really interesting. We get this data:
CHIP_PKG_TYPE|VALUE = INLINE
This also matches the second regex, so we do a lot the same as the third iteration. But the difference here is that when we try to read from the filehandle, we get a warning because that filehandle is closed. Oh, and then we close it again for good measure :-)
As I said at the start, I can't really work out what we're trying to do here. But I can see that the logic is very strange. You really need to go back to the drawing board and think through your logic again.
Update:
With the updated version of your code, I'm still seeing problems.
On the first iteration, the data is:
TASK_CELL_NAME|VALUE = lf
So this matches your second regex. That goes into the piece of code that opens the other file and tries to read from it. But it expects to find the filename in $rpl and that variable hasn't been given a value yet. So the open() fails and the program dies.

Related

How do I rewrite a text file on the same columns in perl?

I have an ncl script which I modify inside a perl script, each time I run the Perl script and then call the same ncl script within the perl script.
What I have noticed is that each time I run the perl script, the lines in the ncl script are progressively rewritten to the right. I wonder how far they will go to the right and if the ncl script will still work if the move very far to the right.
I am wondering if it is possible to specify the column from which to start writing. I have no idea if there is a solution to this, apart from manually boving the lines back.
Below are the changes I make to the script.
open($NCLSCRPT, " < /home/zmumba/DA/TMINTMAX/Scripts/10_PlotT2mMinMax.ncl") || die "open $NCLSCRPT: failed $! ($^E)";
#scrptlines=<$NCLSCRPT>;
foreach $scrptlines (#scrptlines) {
if ($scrptlines =~ /(^|\s+)a = addfile/) {
$scrptlines =~ s/\w+\.nc/Tmin.nc/g;
}
if ($scrptlines =~ /wks = gsn_open_wks/) {
$scrptlines =~ s/\w+_\d{2}/Tmin_$day/g;
}
$scrptlines =~ s/\w+ Temperature/Minimum Temperature/g;
if ($scrptlines =~ /Valid on/) {
$scrptlines =~ s/$valdate/${day}\/${month}\/${year}/g;
}
if ($scrptlines =~ /From 00z on/) {
$scrptlines =~ s/$wrffcr/T\+00H/g;
$scrptlines =~ s/$valdate/${day}\/${month}\/${year}/g;
}
}
close $NCLSCRPT;
open($NCLSCRPT, " > /home/zmumba/DA/TMINTMAX/Scripts/10_PlotT2mMinMax.ncl") || die "open $NCLSCRPT: failed $! ($^E)";
print $NCLSCRPT "#scrptlines\n";
close $NCLSCRPT;
To do justice, i am adding a part of the text file I am processing as per request from #Håkon Hægland.
Otherwise the problem has been resolved.
begin
;Open input file.
;************************
DATADir = "./"
a = addfile(DATADir+"Tmax.nc","r")
T = wrf_user_getvar(a,"T2",0)
T = T-273.15
wrf_smooth_2d( T, 3 ) ; smooth T
;printVarSummary(T)
;exit
lat2d = a->XLAT(:,:)
lon2d = a->XLONG(:,:)
lat = lat2d(:,0) ; create classic 1D coordinate arrays
lon = lon2d(0,:)
lat#units= "degrees_north"
lon#units= "degrees_east"
lat!0 = "lat"
lon!0 = "lon"
lat&lat = lat

You are adding spaces with your final print:
print $NCLSCRPT "#scrptlines\n";
When you interpolate an array into a string, each element is padded with a space (default value of the $" variable). So, when you take the lines from a file, and continuously interpolate them this way, you will add spaces.
It should instead be
print $NCLSCRPT #scrptlines;
If you never edit the line endings, you don't need to worry about them.

Add new hash keys and then print in a new file

Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID).
Now I´m looking for some recommendations to print the data that I looking for.
If you want to see the complete file, here's a GitHub link.
This script takes two files to work. The first file is something like this (this is only a part of the file):
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . .
...
This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file).
Here my second file (this is only a part):
>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*
Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data.
Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23.
Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T:
MTYSVFPLMCLLTFIGANAKIV T LP
When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest:
IVTLP
In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)).
Now I looking for help to obtain the output that I need to complete the script.
The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left.
I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend.
Here is the script:
use strict;
use warnings;
use Bio::SeqIO;

my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my #array_residues = ();
my $seqio_obj = Bio::SeqIO->new(-file => $in,
-format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
my $dd = $seq_obj->id;
my $ss = $seq_obj->seq;
###my $ee = $seq_obj->desc;
$fastadata{$dd} = "$ss";
}

my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5

# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
$one =~ s/\n//g;
$one =~ s/\r//g;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array
my $val = $cols[5];

if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 6);
}
}
}
close F;
I´m thinking in add a push function to generate the new data and then print in a new file.
My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left.
In this case, with the data example in GitHub (link above) the expected output is:
IVTLP
Any recommendation or help is welcome.
Thank!

Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me:
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
chomp $one;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array
my $val = $cols[5];
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 5);
print $subresidues, "\n";
}
}
}

Regular expression to print a string from a command outpout

I have written a function that uses regex and prints the required string from a command output.
The script works as expected. But it's does not support a dynamic output. currently, I use regex for "icmp" and "ok" and print the values. Now, type , destination and return code could change. There is a high chance that command doesn't return an output at all. How do I handle such scenarios ?
sub check_summary{
my ($self) = #_;
my $type = 0;
my $return_type = 0;
my $ipsla = $self->{'ssh_obj'}->exec('show ip sla');
foreach my $line( $ipsla) {
if ( $line =~ m/(icmp)/ ) {
$type = $1;
}
if ( $line =~ m/(OK)/ ) {
$return_type = $1;
}
}
INFO ($type,$return_type);
}
command Ouptut :
PSLAs Latest Operation Summary
Codes: * active, ^ inactive, ~ pending
ID Type Destination Stats Return Last
(ms) Code Run
-----------------------------------------------------------------------
*1 icmp 192.168.25.14 RTT=1 OK 1 second ago

Updated to some clarifications -- we need only the last line
As if often the case, you don't need a regex to parse the output as shown. You have space-separated fields and can just split the line and pick the elements you need.
We are told that the line of interest is the last line of the command output. Then we don't need the loop but can take the last element of the array with lines. It is still unclear how $ipsla contains the output -- as a multi-line string or perhaps as an arrayref. Since it is output of a command I'll treat it as a multi-line string, akin to what qx returns. Then, instead of the foreach loop
my #lines = split '\n', $ipsla; # if $ipsla is a multi-line string
# my #lines = #$ipsla; # if $ipsla is an arrayref
pop #lines while $line[-1] !~ /\S/; # remove possible empty lines at end
my ($type, $return_type) = (split ' ', $lines[-1])[1,4];
Here are some comments on the code. Let me know if more is needed.
We can see in the shown output that the fields up to what we need have no spaces. So we can split the last line on white space, by split ' ', $lines[-1], and take the 2nd and 5th element (indices 1 and 4), by ( ... )[1,4]. These are our two needed values and we assign them.
Just in case the output ends with empty lines we first remove them, by doing pop #lines as long as the last line has no non-space characters, while $lines[-1] !~ /\S/. That is the same as
while ( $lines[-1] !~ /\S/ ) { pop #lines }
Original version, edited for clarifications. It is also a valid way to do what is needed.
I assume that data starts after the line with only dashes. Set a flag once that line is reached, process the line(s) if the flag is set. Given the rest of your code, the loop
my $data_start;
foreach (#lines)
{
if (not $data_start) {
$data_start = 1 if /^\s* -+ \s*$/x; # only dashes and optional spaces
}
else {
my ($type, $return_type) = (split)[1,4];
print "type: $type, return code: $return_type\n";
}
}
This is a sketch until clarifications come. It also assumes that there are more lines than one.

I'm not sure of all possibilities of output from that command so my regular expression may need tweaking.
I assume the goal is to get the values of all columns in variables. I opted to store values in a hash using the column names as the hash keys. I printed the results for debugging / demonstration purposes.
use strict;
use warnings;
sub check_summary {
my ($self) = #_;
my %results = map { ($_,undef) } qw(Code ID Type Destination Stats Return_Code Last_Run); # Put results in hash, use column names for keys, set values to undef.
my $ipsla = $self->{ssh_obj}->exec('show ip sla');
foreach my $line (#$ipsla) {
chomp $line; # Remove newlines from last field
if($line =~ /^([*^~])([0-9]+)\s+([a-z]+)\s+([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)\s+([[:alnum:]=]+)\s+([A-Z]+)\s+([^\s].*)$/) {
$results{Code} = $1; # Code prefixing ID
$results{ID} = $2;
$results{Type} = $3;
$results{Destination} = $4;
$results{Stats} = $5;
$results{Return_Code} = $6;
$results{Last_Run} = $7;
}
}
# Testing
use Data::Dumper;
print Dumper(\%results);
}
# Demonstrate
check_summary();
# Commented for testing
#INFO ($type,$return_type);
Worked on the submitted test line.
EDIT:
Regular expressions allow you to specify patterns instead of the exact text you are attempting to match. This is powerful but complicated at times. You need to read the Perl Regular Expression documentation to really learn them.
Perl regular expressions also allow you to capture the matched text. This can be done multiple times in a single pattern which is how we were able to capture all the columns with one expression. The matches go into numbered variables...
$1
$2

Remove duplicate lines on file by substring - preserve order (PERL)

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.

It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

Perl program for extracting the functions alone in a Ruby file

I am having the following Ruby program.
puts "hai"
def mult(a,b)
a * b
end
puts "hello"
def getCostAndMpg
cost = 30000 # some fancy db calls go here
mpg = 30
return cost,mpg
end
AltimaCost, AltimaMpg = getCostAndMpg
puts "AltimaCost = #{AltimaCost}, AltimaMpg = {AltimaMpg}"
I have written a perl script which will extract the functions alone in a Ruby file as follows
while (<DATA>){
print if ( /def/ .. /end/ );
}
Here the <DATA> is reading from the ruby file.
So perl prograam produces the following output.
def mult(a,b)
a * b
end
def getCostAndMpg
cost = 30000 # some fancy db calls go here
mpg = 30
return cost,mpg
end
But, if the function is having block of statements, say for example it is having an if condition testing block means then it is not working. It is taking only up to the "end" of "if" block. And it is not taking up to the "end" of the function. So kindly provide solutions for me.
Example:
def function
if x > 2
puts "x is greater than 2"
elsif x <= 2 and x!=0
puts "x is 1"
else
puts "I can't guess the number"
end #----- My code parsing only up to this
end
Thanks in Advance!

If your code is properly indented, you just want lines that start with def or end, so change your program to:
while (<DATA>){
print if ( /^def/ .. /^end/ );
}
Or run it without a program file at all - run the program from the command line, using -n to have perl treat it as a while loop reading from STDIN:
perl -n -e "print if ( /^def/ .. /^end/ );" < ruby-file.rb

I am not familiar with ruby syntax but if you can ensure good indentation all over the code, you can check based on indentation. Something similar to:
my $add = 0;
my $spaces;
while(my $str = <DATA>) {
if (! $add && $str =~ /^(\s*)def function/) {
$add = 1;
$spaces = $1;
}
if ($add) {
print $_;
$add = 0 if ($str =~ /^$spaces\S/);
}
}

Another option could be counting level of program, something like this:
my $level = 0;
while(<DATA>) {
if(/\b def \b/x .. /\b end \b/x && $level == 0) {
$level++ if /\b if \b/x; # put all statements that closes by end here
$level-- if /\b end \b/x;
print;
}
}
I am not all that familiar with ruby syntax, so you need to put all statements that are closed by end into regex with $level++.
Please note I added \b around those keywords to make sure you are matching whole word and not things like undef as start of function.