I'm writing a script to parse thread dumps from Java. For some reason when I try to read from within the subroutine, or inside a nest loop, it doesn't enter the nested loop at all. Ideally I want to be able to operate on STDIN on nested loops otherwise you'll have to write some ugly state transition code.
Before I was using STDIN, but just to make sure that my subroutine didn't have an independent pointer to STDIN, I opened it into $in.
When I run it, it looks like below. You can see that it never enters the nested loop despite the outer loop having more files from STDIN to read.
~/$ cat catalina.out-20160* | thread.dump.find.all.pl
in is GLOB(0x7f8d440054e8)
found start of thread dump at 2016-06-17 13:38:23 saving to tdump.2016.06.17.13.38.23.txt
in is GLOB(0x7f8d440054e8)
BEFORE NESTED STDIN
BUG!!!!
found start of thread dump at 2016-06-17 13:43:05 saving to tdump.2016.06.17.13.43.05.txt
in is GLOB(0x7f8d440054e8)
BEFORE NESTED STDIN
BUG!!!!
...
The code:
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use DateTime::Format::Strptime;
use DateTime::Format::Duration;
use Data::Dumper;
# DO NOT touch ARGV!
Getopt::Long::Configure("pass_through");
# cat catalina.out-* | thread.dump.find.all.pl
sub processThreadDump {
my $in=$_[0];
my $currentLine=$_[1];
my $prevLine=$_[2];
my $parsedDatetime=$_[2];
# 2016-09-28 09:27:34
$parsedDatetime=~ s/[ \-\:]/./g;
my $outfile="tdump.$parsedDatetime.txt";
print " saving to $outfile\n";
print " in is $in\n";
open(my $out, '>', $outfile);
print $out "$prevLine\n";
print $out "$currentLine\n";
print "BEFORE NESTED STDIN\n";
foreach my $line ( <$in> ) {
print "INSIDE NESTED STDIN\n";
$line =~ s/\R//g; #remove newlines
print $out "$line\n";
if( $line =~ m/JNI global references:/ ) {
print "PROPERLY LEFT NESTED STDIN\n";
close($out);
return;
} elsif( $line =~ m/Found \d+ deadlock\./ ) {
print "PROPERLY LEFT NESTED STDIN\n";
close($out);
return;
}
}
print "BUG!!!!\n";
close($out);
}
open(my $in, '<-');
print "in is $in\n";
my $prevLine;
# read from standard in
foreach my $line ( <$in> ) {
$line =~ s/\R//g; #remove newlines
if( $line =~ m/Full thread dump OpenJDK 64-Bit Server VM/ ) {
# we found the start of a thread dump
print "found start of thread dump at ${prevLine}";
processThreadDump($in, $line, $prevLine);
} else {
#print "setting prev line to $line\n";
$prevLine=$line;
}
}
close($in);
The foreach iterates over a list, so <> is in the list context and thus it reads everything from the filehandle. So when you pass $in to the sub there's no input left on it. See I/O Operators in perlop.
You can read a line at a time, while (my $line = <$in>), but I am not sure whether that may affect the rest of your algorithm.
Alternatively, if you do read all input ahead of time why not just work with an array of lines then.
When you say foreach my $line ( <$in> ), this causes perl to read the entire $in filehandle before starting the loop. What you probably want is more like this:
while (defined(my $line = <$in>))
This will only read one line at a time, discarding it as you finish with it.
Related
I am trying to convert a string into an array based on space delimiter.
My input file looks like this:
>Reference
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnctcACCATGGTGTCGACTC
TTCTATGGAAACAGCGTGGATGGCGTCTCCAGGCGATCTGACGGTTCACTAAACGAGCTC
Ignoring the line starting with >, the length of rest of the string is 360.
I am trying to convert this into an array.
Here's my code so far:
#!/usr/bin/perl
use strict;
use warnings;
#### To to change bases with less than 10X coverage to N #####
#### Take depth file and consensus fasta file as input arguments ####
my ($in2) = #ARGV;
my $args = $#ARGV + 1;
if ( $args != 1 ) {
print "Error!!! Insufficient Number of Argumrnts\n";
print "Usage: $0 <consensus fasta file> \n";
}
#### Open a filehandle to read in consensus fasta file ####
my $FH2;
my $line;
my #consensus;
my $char;
open($FH2, '<', $in2) || die "Could not open file $in2\n";
while ( <$FH2> ) {
$line = $_;
chomp $line;
next if $line =~ />/; # skip header line
$line =~ s/\s+//g;
my $len = length($line);
print "$len\n";
#print "$line";
#consensus = split(// , $line);
print "$#consensus\n";
#print "#consensus\n";
#for $char (0 .. $#consensus){
# print "$char: $consensus[$char]\n";
# }
}
The problem is the $len variable returns a value of 60 instead of 360 and $#consensus returns a value of 59 instead of 360 which is the length of the string.
I have removed the whitespace after each line with code $line =~ s/\s+//g;but it still is not working.
It looks like your code is essentially working. It's just your checking logic that makes no sense. I'd do the following:
use strict;
use warnings;
if (#ARGV != 1) {
print STDERR "Usage: $0 <consensus fasta file>\n";
exit 1;
}
open my $fh, '<', $ARGV[0] or die "$0: cannot open $ARGV[0]: $!\n";
my #consensus;
while (my $line = readline $fh) {
next if $line =~ /^>/;
$line =~ s/\s+//g;
push #consensus, split //, $line;
}
print "N = ", scalar #consensus, "\n";
Main things to note:
Error messages should go to STDERR, not STDOUT.
If an error occurs, the program should exit with an error code, not keep running.
Error messages should include the name of the program and the reason for the error.
chomp is redundant if you're going to remove all whitespace anyway.
As you're processing the input line by line, you can just keep pushing elements to the end of #consensus. At the end of the loop it'll have accumulated all characters across all lines.
Examining #consensus within the loop makes little sense as it hasn't finished building yet. Only after the loop do we have all characters we're interested in.
Here is a part of my script:
foreach $i ( #contact_list ) {
print "$i\n";
$e = "zcat $file_list2| grep $i";
print "$e\n";
$f = qx($e);
print "$f";
}
$e prints properly but $f gives a blank line even when $file_list2 has a match for $i.
Can anyone tell me why?
Always is better to use Perl's grep instead of using pipe :
#lines = `zcat $file_list2`; # move output of zcat to array
die('zcat error') if ($?); # will exit script with error if zcat is problem
# chomp(#lines) # this will remove "\n" from each line
foreach $i ( #contact_list ) {
print "$i\n";
#ar = grep (/$i/, #lines);
print #ar;
# print join("\n",#ar)."\n"; # in case of using chomp
}
Best solution is not calling zcat, but using zlib library :
http://perldoc.perl.org/IO/Zlib.html
use IO::Zlib;
# ....
# place your defiiniton of $file_list2 and #contact list here.
# ...
$fh = new IO::Zlib; $fh->open($file_list2, "rb")
or die("Cannot open $file_list2");
#lines = <$fh>;
$fh->close;
#chomp(#lines); #remove "\n" symbols from lines
foreach $i ( #contact_list ) {
print "$i\n";
#ar = grep (/$i/, #lines);
print (#ar);
# print join("\n",#ar)."\n"; #in case of using chomp
}
Your question leaves us guessing about many things, but a better overall approach would seem to be opening the file just once, and processing each line in Perl itself.
open(F, "zcat $file_list |") or die "$0: could not zcat: $!\n";
LINE:
while (<F>) {
######## FIXME: this could be optimized a great deal still
foreach my $i (#contact_list) {
if (m/$i/) {
print $_;
next LINE;
}
}
}
close (F);
If you want to squeeze out more from the inner loop, compile the regexes from #contact_list into a separate array before the loop, or perhaps combine them into a single regex if all you care about is whether one of them matched. If, on the other hand, you want to print all matches for one pattern only at the end when you know what they are, collect matches into one array per search expression, then loop them and print when you have grepped the whole set of input files.
Your problem is not reproducible without information about what's in $i, but I can guess that it contains some shell metacharacter which causes it to be processed by the shell before the grep runs.
When I run this code, I am purely trying to get all the lines containing the word "that" in them. Sounds easy enough. But when I run it, I get a list of matches that contain the word "that" but only at the end of the line. I don't know why it's coming out like this and I have been going crazy trying to solve it. I am currently getting an output of 268 total matches, and the output I need is only 13. Please advise!
#!/usr/bin/perl -w
#Usage: conc.shift.pl textfile word
open (FH, "$ARGV[0]") || die "cannot open";
#array = (1,2,3,4,5);
$count = 0;
while($line = <FH>) {
chomp $line;
shift #array;
push(#array, $line);
$count++;
if ($line =~ /that/)
{
$output = join(" ",#array);
print "$output \n";
}
}
print "Total matches: $count\n";
Don't you want to increment your $count variable only if the line contains "that", i.e.:
if ($line =~ /that/) {
$count++;
instead of incrementing the counter before checking if $line contains "that", as you have it:
$count++;
if ($line =~ /that/) {
Similarly, I suspect that your push() and join() calls, for stashing a matching line in #array, should also be within the if block, only executed if the line contains "that".
Hope this helps!
I have a sample file here http://pastebin.com/m5m40nGF
What I want to do is add a line after every instance of protein_id.
protein_id always has the same pattern:
TAB-TAB-TAB-protein_id-TAB-gnl|CorradiLab|M715_#SOME_NUMBER
What I need to do is to add this after every line of protein_id:
TAB-TAB-TAB-transcript_id-TAB-gnl|CorradiLab|M715_mRNA_#SOME_NUMBER
The catch is that #SOME_NUMBER has to stay the same.
In the first case, it would look like this:
94 1476 CDS
protein_id gnl|CorradiLab|M715_ECU01_0190
transcript_id gnl|CorradiLab|M715_mRNA_ECU01_0190
product serine hydroxymethyltransferase
label serine hydroxymethyltransferase
Thanks! Adrian
I tried a perl solution, but I get an error.
open(IN, $in); while(<IN>){
print $_;
if ($_ ~= /gnl\|CorradiLab\|/) {
$_ =~ s/tprotein_id/transcript_id/;
print $_;
}
}
Error:
syntax error at test.pl line 3, near "$_ ~"
syntax error at test.pl line 7, near "}"
Execution of test.pl aborted due to compilation errors.
The following perl script worked
my $in=shift;
open(IN, $in); while(<IN>){
print $_;
if ($_ =~ /gnl\|CorradiLab\|/) {
my $tmp = $_;
$tmp =~ s/protein_id/transcript_id/;
print $tmp;
}
}
Offering an update on existing answer because I feel it can be improved further:
Generally - the precise problem in the OP is this line:
if ($_ ~= /gnl\|CorradiLab\|/) {
Because you've got ~= not =~. That's what syntax error at test.pl line 3, near "$_ ~" is trying to tell you.
I would offer that improving on:
my $in=shift;
open(IN, $in); while(<IN>){
print $_;
if ($_ =~ /gnl\|CorradiLab\|/) {
my $tmp = $_;
$tmp =~ s/protein_id/transcript_id/;
print $tmp;
}
}
while ( my $tmp = <IN> ) { skips the need to assign $_.
3 argument open with lexical filehandle is preferable. E.g. open ( my $in, "<", "$input_filename" ) or die $!; (You should test whether the open worked too)
Explicit open may well be unnecessary if you're just reading a filename from command line. Using <> either reads filenames (opening and processing) or STDIN, which means your script becomes a bit more versatile.
Thus I would rewrite as:
#!/usr/bin/perl
use strict;
use warnings;
while ( my $line = <> ) {
print $line;
if ( $line =~ /gnl\|CorradiLab\|/ ) {
$line =~ s/protein_id/transcript_id/;
print $line;
}
}
Or alternatively:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
print;
if (m/gnl\|CorradiLab\|/) {
s/protein_id/transcript_id/;
print;
}
}
I need some perl help in putting these (2) processes/code to work together. I was able to get them working individually to test, but I need help bringing them together especially with using the loop constructs. I'm not sure if I should go with foreach..anyways the code is below.
Also, any best practices would be great too as I'm learning this language. Thanks for your help.
Here's the process flow I am looking for:
read a directory
look for a particular file
use the file name to strip out some key information to create a newly processed file
process the input file
create the newly processed file for each input file read (if i read in 10, I create 10 new files)
Part 1:
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
next if ($file =~ /^\.+$/);
#Get filename attributes
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
print "$1\n";
print "$2\n";
print "$3\n";
}
print "$file\n";
}
Part 2:
use strict;
use Digest::MD5 qw(md5_hex);
#Create new file
open (NEWFILE, ">/backups/processed/foo$1.name.$2-foo_p$3.out") || die "cannot create file";
my $data = '';
my $line1 = <>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ( "^A", "^E", "^D");
while (<>)
{
my $digest = md5_hex($data);
chomp;
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2" ;
$extra .= "$heading[$_]$sep1$values[$_]$sep2" for (0..scalar(#values));
$data .= "$extra$eorec";
print NEWFILE "$data";
}
#print $data;
close (NEWFILE);
You are using an old-style of Perl programming. I recommend you to use functions and CPAN modules (http://search.cpan.org). Perl pseudocode:
use Modern::Perl;
# use...
sub get_input_files {
# return an array of files (#)
}
sub extract_file_info {
# takes the file name and returs an array of values (filename attrs)
}
sub process_file {
# reads the input file, takes the previous attribs and build the output file
}
my #ifiles = get_input_files;
foreach my $ifile(#ifiles) {
my #attrs = extract_file_info($ifile);
process_file($ifile, #attrs);
}
Hope it helps
I've bashed your two code fragments together (making the second a sub that the first calls for each matching file) and, if I understood your description of the objective correctly, this should do what you want. Comments on style and syntax are inline:
#!/usr/bin/env perl
# - Never forget these!
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
# Parens on postfix "if" are optional; I prefer to omit them
next if $file =~ /^\.+$/;
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
process_file($file, $1, $2, $3);
}
print "$file\n";
}
sub process_file {
my ($orig_name, $foo_x, $name_x, $p_x) = #_;
my $new_name = "/backups/processed/foo$foo_x.name.$name_x-foo_p$p_x.out";
# - From your description of the task, it sounds like we actually want to
# read from the found file, not from <>, so opening it here to read
# - Better to use lexical ("my") filehandle and three-arg form of open
# - "or" has lower operator precedence than "||", so less chance of
# things being grouped in the wrong order (though either works here)
# - Including $! in the error will tell why the file open failed
open my $in_fh, '<', $orig_name or die "cannot read $orig_name: $!";
open(my $out_fh, '>', $new_name) or die "cannot create $new_name: $!";
my $data = '';
my $line1 = <$in_fh>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ("^A", "^E", "^D");
while (<$in_fh>) {
chomp;
my $digest = md5_hex($data);
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2";
$extra .= "$heading[$_]$sep1$values[$_]$sep2"
for (0 .. scalar(#values));
# - Useless use of double quotes removed on next two lines
$data .= $extra . $eorec;
#print $out_fh $data;
}
# - Moved print to output file to here (where it will print the complete
# output all at once) rather than within the loop (where it will print
# all previous lines each time a new line is read in) to prevent
# duplicate output records. This could also be achieved by printing
# $extra inside the loop. Printing $data at the end will be slightly
# faster, but requires more memory; printing $extra within the loop and
# getting rid of $data entirely would require less memory, so that may
# be the better option if you find yourself needing to read huge input
# files.
print $out_fh $data;
# - $in_fh and $out_fh will be closed automatically when it goes out of
# scope at the end of the block/sub, so there's no real point to
# explicitly closing it unless you're going to check whether the close
# succeeded or failed (which can happen in odd cases usually involving
# full or failing disks when writing; I'm not aware of any way that
# closing a file open for reading can fail, so that's just being left
# implicit)
close $out_fh or die "Failed to close file: $!";
}
Disclaimer: perl -c reports that this code is syntactically valid, but it is otherwise untested.