Extracting multiple lines of record/data using a subroutine or functions - perl

Can you show me how to create a subroutine or function using this code?
Basically I want to make my code into a subroutine so I'll be able to re-use it without making my script too long.
Here is my script:
#!/usr/local/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Carp qw(croak);
my #fields;
my ($tmp_var, $rec_type, $country, $header, $Combline, $records, $line);
my $filename = 'data5.txt';
open (my $input_fh, '<', $filename ) or croak "Can't open $filename: $!";
open my $OUTPUTA, ">", 'drp1.txt' or die $!;
open my $OUTPUTB, ">", 'drp2.txt' or die $!;
while (<$input_fh>) {
$line = _trim($_);
#fields = split (/\|/, $line);
$rec_type = $fields[0];
$country = $fields[1];
my $string = substr $fields[1], 0, 1;
$header = $line if(/^INVHDR/);
if ($rec_type eq 'INVDET') {
if ($string eq 'I') {
$records = $header . $line;
print $OUTPUTA $records, scalar <$input_fh>;
}
else {
$records = $header . $line;
print $OUTPUTB $records, scalar <$input_fh>;
}
}
}
close $OUTPUTA or die $!;
close $OUTPUTB or die $!;
sub _trim {
my $word = shift;
if ( $word ) {
$word =~ s/\s*\|/\|/g; #remove trailing spaces
$word =~ s/"//g; #remove double quotes
}
return $word;
}
This is the part of the script that I wanted to put in a subroutine or function:
$line = _trim($_);
#fields = split (/\|/, $line);
$rec_type = $fields[0];
$country = $fields[1];
my $string = substr $fields[1], 0, 1;
$header = $line if (/^INVHDR/);
if ($rec_type eq 'INVDET') {
if ($string eq 'I') {
$records = $header . $line;
print $OUTPUTA $records, scalar <$input_fh>;
}
else {
$records = $header . $line;
print $OUTPUTB $records, scalar <$input_fh>;
}
}

I would suggest breaking it out a little differently and expand on your _trim function, turning it into a parse function:
use strict;
use warnings;
open( my $input_fh, '<', 'data5.txt' ) or die "Can't open $filename: $!";
open( my $OUTPUTA, '>', 'drp1.txt' ) or die $!;
open( my $OUTPUTB, '>', 'drp2.txt' ) or die $!;
my $header = '';
while (<$input_fh>) {
if ($_ =~ /^INVHDR/) {
$header = $_;
}
if ($_ =~ /^INVDET/) {
my #data = parse($_);
my $line = $header . join('|', #data);
# scalar <$input_fh> is almost certainly not doing what you expect,
# though I'm not sure what you're try to accomplish with it
if ( $data[1] =~ /^I/ ) {
print $OUTPUTA $line;
} else {
print $OUTPUTB $line;
}
}
}
sub parse {
my $input = shift || return;
my $input =~ s/"//g; # remove double quotes
# Here I've combined the removal of trailing spaces with the split.
my #fields = split( m{\s*\|}, $input );
return #fields;
}

Related

how to display the hash value from my sample data

I'm learning perl at the moment, i wanted to ask help to answer this exercise.
My objective is to display the hash value of PartID 1,2,3
the sample output is displaying lot, wafer, program, version, testnames, testnumbers, hilimit, lolimit and partid values only.
sample data
lot=lot123
wafer=1
program=prgtest
version=1
Testnames,T1,T2,T3
Testnumbers,1,2,3
Hilimit,5,6,7
Lolimit,1,2,3
PartID,,,,
1,3,0,5
2,4,3,2
3,5,6,3
This is my code:
#!/usr/bin/perl
use strict;
use Getopt::Long;
my $file = "";
GetOptions ("infile=s" => \$file ) or die("Error in command line arguments\n");
my $lotid = "";
open(DATA, $file) or die "Couldn't open file $file";
while(my $line = <DATA>) {
#print "$line";
if ( $line =~ /^lot=/ ) {
#print "$line \n";
my ($dump, $lotid) = split /=/, $line;
print "$lotid\n";
}
elsif ($line =~ /^program=/ ) {
my ($dump, $progid) = split /=/, $line;
print "$progid \n";
}
elsif ($line =~ /^wafer=/ ) {
my ($dump, $waferid) = split /=/, $line;
print "$waferid \n";
}
elsif ($line =~ /^version=/ ) {
my ($dump, $verid) = split /=/, $line;
print "$verid \n";
}
elsif ($line =~ /^testnames/i) {
my ($dump, #arr) = split /\,/, $line;
foreach my $e (#arr) {
print $e, "\n";
}
}
elsif ($line =~ /^testnumbers/i) {
my ($dump, #arr1) = split /\,/, $line;
foreach my $e1 (#arr1) {
print $e1, "\n";
}
}
elsif ($line =~ /^hilimit/i) {
my ($dump, #arr2) = split /\,/, $line;
foreach my $e2 (#arr2) {
print $e2, "\n";
}
}
elsif ($line =~ /^lolimit/i) {
my ($dump, #arr3) = split /\,/, $line;
foreach my $e3 (#arr3) {
print $e3, "\n";
}
}
}
Kindly help add to my code to display Partid 1,2,3 hash.
So I've rewritten your code a little to use a few more modern Perl idioms (along with some comments to explain what I've done). The bit I've added is near the bottom.
#!/usr/bin/perl
use strict;
# Added 'warnings' which you should always use
use warnings;
# Use say() instead of print()
use feature 'say';
use Getopt::Long;
my $file = "";
GetOptions ("infile=s" => \$file)
or die ("Error in command line arguments\n");
# Use a lexical variable for a filehandle.
# Use the (safer) 3-argument version of open().
# Add $! to the error message.
open(my $fh, '<', $file) or die "Couldn't open file $file: $!";
# Read each record into $_ - which makes the following code simpler
while (<$fh>) {
# Match on $_
if ( /^lot=/ ) {
# Use "undef" instead of a $dump variable.
# split() works on $_ by default.
my (undef, $lotid) = split /=/;
# Use say() instead of print() - less punctuation :-)
say $lotid;
}
elsif ( /^program=/ ) {
my (undef, $progid) = split /=/;
say $progid;
}
elsif ( /^wafer=/ ) {
my (undef, $waferid) = split /=/;
say $waferid;
}
elsif ( /^version=/ ) {
my (undef, $verid) = split /=/;
say $verid;
}
elsif ( /^testnames/i) {
my (undef, #arr) = split /\,/;
# Changed all of these similar pieces of code
# to use the same variable names. As they are
# defined in different code blocks, they are
# completely separate variables.
foreach my $e (#arr) {
say $e;
}
}
elsif ( /^testnumbers/i) {
my (undef, #arr) = split /\,/;
foreach my $e (#arr) {
say $e;
}
}
elsif ( /^hilimit/i) {
my (undef, #arr) = split /\,/;
foreach my $e (#arr) {
say $e;
}
}
elsif ( /^lolimit/i) {
my (undef, #arr) = split /\,/;
foreach my $e (#arr) {
say $e;
}
}
# And here's the new bit.
# If we're on the "partid" line, then read the next
# three lines, split each one and print the first
# element from the list returned by split().
elsif ( /^partid/i) {
say +(split /,/, <$fh>)[0] for 1 .. 3;
}
}
Update: By the way, there are no hashes anywhere in this code :-)
Update 2: I've just realised that you only have three different ways to process the data. So you can simplify your code drastically by using slightly more complex regexes.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Getopt::Long;
my $file = "";
GetOptions ("infile=s" => \$file)
or die ("Error in command line arguments\n");
open(my $fh, '<', $file) or die "Couldn't open file $file: $!";
while (<$fh>) {
# Single value - just print it.
if ( /^(?:lot|program|wafer|version)=/ ) {
my (undef, $value) = split /=/;
say $value;
}
# List of values - split and print.
elsif ( /^(?:testnames|testnumbers|hilimit|lolimit)/i) {
my (undef, #arr) = split /\,/;
foreach my $e (#arr) {
say $e;
}
}
# Extract values from following lines.
elsif ( /^partid/i) {
say +(split /,/, <$fh>)[0] for 1 .. 3;
}
}

Correct use of Perl "exists"

I have two files. The first two columns in both are chromosome loci and genotypes, for instance chr1:1736464585 and T/G.
I have put the first two columns into a hash. I want to check whether the hash key (the chromosome locus) exists in the second file.
I have written this Perl program and have tried many variations but I'm not sure if I'm using exists correctly: it gives the error exists is not an HASH or ARRAY element or a subroutine.
#!/usr/bin/perl
use strict;
use warnings;
my $output = "annotated.txt";
open( O, ">>$output" );
my $filename = "datatest.txt";
my $filename2 = "MP2.txt";
chomp $filename;
chomp $filename2;
my %hash1 = ();
open( FN1, $filename ) or die "Can't open $filename: $!";
my #lines = <FN1>;
foreach my $line (#lines) {
my #split = split /\t/, $line;
if ( $line =~ /^chr/ ) {
my ( $key, $value ) = ( $split[0], $split[1] );
$hash1{$key} = $value;
}
}
my $DATA;
open( $DATA, $filename2 ) or die $!;
my #lines2 = <$DATA>;
foreach my $line2 (#lines2) {
my #split2 = split /\t/, $line2;
if ( $line2 =~ /^chr/ ) {
if ( exists %hash1{$key} ) {
print "$line2\n";
}
}
}
The syntax of the following line is incorrect:
if (exists %hash1{$key}) { ... }
This should be:
if (exists $hash1{$key}) { ... }

Nested if statements: Swapping headers and sequences in fasta files

I am opening a directory and processing each file. A sample file looks like this when opened:
>AAAAA
TTTTTTTTTTTAAAAATTTTTTTTTT
>BBBBB
TTTTTTTTTTTTTTTTTTBBBBBTTT
>CCCCC
TTTTTTTTTTTTTTTTCCCCCTTTTT
For the above sample file, I am trying to make them look like this:
>TAAAAAT
AAAAA
>TBBBBBT
BBBBB
>TCCCCCT
CCCCC
I need to find the "header" in next line sequence, take flanks on either side of the match, and then flip them. I want to print each file's worth of contents to another separate file.
Here is my code so far. It runs without errors, but doesn't generate any output. My guess is this is probably related to the nested if statements. I have never worked with those before.
#!/usr/bin/perl
use strict;
use warnings;
my ($directory) = #ARGV;
my $dir = "$directory";
my #ArrayofFiles = glob "$dir/*";
my $count = 0;
open(OUT, ">", "/path/to/output_$count.txt") or die $!;
foreach my $file(#ArrayofFiles){
open(my $fastas, $file) or die $!;
while (my $line = <$fastas>){
$count++;
if ($line =~ m/(^>)([a-z]{5})/i){
my $header = $2;
if ($line !~ /^>/){
my $sequence .= $line;
if ($sequence =~ m/(([a-z]{1})($header)([a-z]{1}))/i){
my $matchplusflanks = $1;
print OUT ">", $matchplusflanks, "\n", $header, "\n";
}
}
}
}
}
How can I fix this code? Thanks.
Try this
foreach my $file(#ArrayofFiles)
{
open my $fh," <", $file or die"error opening $!\n";
while(my $head=<$fh>)
{
chomp $head;
$head=~s/>//;
my $next_line = <$fh>;
my($extract) = $next_line =~m/(.$head.)/;
print ">$extract\n$head\n";
}
}
There are several mistakes in your code but the main problem is:
if ($line =~ m/(^>)([a-z]{5})/i) {
my $header = $2;
if ($line !~ /^>/) {
# here you write to the output file
Because the same line can't start and not start with > at the same time, your output files are never written. The second if statement always fails and its block is never executed.
open(OUT, ">", "/path/to/output_$count.txt") or die $!; and $count++ are misplaced. Since you want to produce an output file (with a new name) for each input file, you need to put them in the foreach block, not outside or in the while loop.
Example:
#!/usr/bin/perl
use strict;
use warnings;
my ($dir) = #ARGV;
my #files = glob "$dir/*";
my $count;
my $format = ">%s\n%s\n";
foreach my $file (#files) {
open my $fhi, '<', $file
or die "Can't open file '$file': $!";
$count++;
my $output_path = "/path/to/output_$count.txt";
open my $fho, '>', $output_path
or die "Can't open file '$output_path': $!";
my ($header, $seq);
while(<$fhi>) {
chomp;
if (/^>([a-z]{5})/i) {
if ($seq) { printf $fho $format, $seq =~ /([a-z]$header[a-z])/i, $header; }
($header, $seq) = ($1, '');
} else { $seq .= $_; }
}
if ($seq) { printf $fho $format, $seq =~ /([a-z]$header[a-z])/i, $header; }
}
close $fhi;
close $fho;

Is there a better way of writing this code to avoid redundancy?

I have a segment of code in a program, which accepts GNU style input from a pipe (which is a list of file names). If STDIN does not contain data, I need to accept input from a predetermined text file containing file names.
I find myself needing to write redundant code. Is it possible to simplify this bit of code to avoid redundancy?
sub downloadlinkgen {
my $fh;
print "Printing links\n";
if ($getfilelist==1) {
open $fh, '<', "fuzzyfile" or die $!;
while (<$fh>) {
chomp ($_);
(my $fname,my $path, my $suffix) = fileparse($_);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}
} else {
while (<>) {
chomp ($_);
(my $fname,my $path, my $suffix) = fileparse($_);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}
}
}
Yes, just make the default ARGV filehandle open the file:
sub downloadlinkgen {
#ARGV = 'fuzzyfile' if $getfilelist == 1;
print "Printing links\n";
while (<>) {
chomp ($_);
(my $fname,my $path, my $suffix) = fileparse($_);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}
}
sub downloadlinkgen {
# default file handle
my $fh = \*ARGV;
print "Printing links\n";
if ($getfilelist==1) {
open $fh, '<', "fuzzyfile" or die $!;
}
while (<$fh>) {
chomp ($_);
(my $fname,my $path, my $suffix) = fileparse($_);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}
}
From perldoc -f readline
Reads from the filehandle whose typeglob is contained in EXPR (or from *ARGV if EXPR is not provided)
so \*ARGV is reference to file handle used when reading from <>, and you can use $fh in both cases.
Even if you don't know about ARGV, you could do something simple like this:
sub downloadlinkgen {
my $fh;
print "Printing links\n";
if ($getfilelist==1) {
open $fh, '<', "fuzzyfile" or die $!;
while (<$fh>) {
process_line($_);
}
} else {
while (<>) {
process_line($_);
}
}
}
sub process_line {
my $line = shift;
chomp ($line);
(my $fname,my $path, my $suffix) = fileparse($line);
my ($name, $ext) = $fname =~ /(.*)\.(.*)/;
my $newfile=$path.$name.".$ext";
$newfile =~ s/\s/%20/g;
$newfile =~ s/\/root/http:\/\/myip/;
print $newfile."\n";
}

Add counter to if statement

How can I add a counter to this statement.
# go through each reference file
for my $file (#reference_files)
{
open my $ref, "<", $file or die "Can't open reference file '$file': $!";
while (my $line = <$ref>)
{
chomp $line;
my ($scaffold, undef, $type, $org_snp, $new_snp, undef, undef, undef, $info) = split /\t/, $line;
next if not $scaffold =~ /^KB/;
next if not $type =~ /^GENE/i;
my ($transcript_id, $gene_name, $auto) = split /[;][ ]/, $info;
$gene_name = $1 if $gene_name =~ /["]([^"]*)["]/;
if (my $matching_genes = $genes{$scaffold})
{
say join "\t", $gene_name, $_ for values %$matching_genes;
}
}
say "###";
}
I would like the script to additionally count all $matching_genes. Is there a way to incorporate this? I've been unsuccessful with standard counters (i.e. $i++) as it's pulling all values in the hash.
You can have a global counter variable on the top intialized to 0 before your for loop, say:
my $counter = 0;
# go through each reference file
for my $file (#reference_files)
# ... Rest of your code ...
Then, you can increment $counter inside of the if statement where $matching_genes is assigned:
if (my $matching_genes = $genes{$scaffold})
{
$counter++;
say join "\t", $gene_name, $_ for values %$matching_genes;
}
my $count=0;
# go through each reference file
for my $file (#reference_files)
{
open my $ref, "<", $file or die "Can't open reference file '$file': $!";
while (my $line = <$ref>)
{
chomp $line;
my ($scaffold, undef, $type, $org_snp, $new_snp, undef, undef, undef, $info) = split /\t/, $line;
next if not $scaffold =~ /^KB/;
next if not $type =~ /^GENE/i;
my ($transcript_id, $gene_name, $auto) = split /[;][ ]/, $info;
$gene_name = $1 if $gene_name =~ /["]([^"]*)["]/;
if (my $matching_genes = $genes{$scaffold})
{
say join "\t", $gene_name, $_ for values %$matching_genes;
$count =+ scalar(keys %$matching_genes);
}
}
say "###";
}
print "total: $count\n";