I am trying to make a table looking content of one input file but it constantly gives me an error
Use of uninitialized value $ac[3] in concatenation (.) or string at table.pl
line 58 (#1)
and
Use of uninitialized value $or[2] in concatenation (.) or string at table.pl
line 61 (#1)
and although I made almost every possible changes it still gives me an error and does not print well.
This is how my input file looks like:
HEADER OXIDOREDUCTASE 08-JUN-12 2LU5
EXPDTA SOLID-STATE NMR
REMARK 2 RESOLUTION. NOT APPLICABLE.
HETNAM CU COPPER (II) ION
HETNAM ZN ZINC
FORMUL 2 CU CU 2+
FORMUL 2 ZN ZN 2+
END
This is a script I am using:
#!/usr/bin/env perl
use strict;
use warnings;
use diagnostics;
#my $testfile=shift;
open(INPUT, "$ARGV[0]") or die 'Cannot make it';
my #file=<INPUT>;
close INPUT;
my #ac=();
my #dr=();
my #os=();
my #or=();
my #fo=();
for (my $line=0;$line<=$#file;$line++)
{
chomp($file[$line]);
if ($file[$line] =~ /^HEADER/)
{
print( (split '\s+', $file[$line])[-1]);
print "\t";
while ($file[$line] !~ /^END /)
{
$line++;
if ($file[$line]=~/^EXPDTA/)
{
$file[$line]=~s/^EXPDTA//;
#os=(#os,split '\s+', $file[$line]);
}
if ($file[$line] =~ /^REMARK 2 RESOLUTION./)
{
$file[$line]=~s/^REMARK 2 RESOLUTION.//;
#ac = (#ac,split'\s+',$file[$line]);
}
if ($file[$line] =~ /^HETNAM/)
{
$file[$line]=~s/^HETNAM//;
$file[$line] =~ s/\s+//;
push #dr, $file[$line];
}
if ($file[$line] =~ /^SOURCE 2 ORGANISM_SCIENTIFIC/)
{
$file[$line]=~s/^SOURCE 2 ORGANISM_SCIENTIFIC//;
#or = (#or,split'\s+',$file[$line]);
}
if ($file[$line] =~ /^FORMUL/)
{
$file[$line]=~s/^FORMUL//;
$file[$line] =~ s/\s+//;
push #fo, $file[$line];
}
}
print "$os[1] $os[2]\t";
print "\t";
#os=();
print "$ac[3] $ac[4]\t" or die "Cannot be printed"; #line 58
print "\t";
#ac=();
print "$or[2] $or[3]\t" or die "Cannot be printed"; #line 61
print "\t";
#or=();
foreach (#dr)
{
print "$_";
print "\t\t\t\t\t";
}
#dr=();
print "\n";
}
}
And this is the output it gives me, but it doesnt seems to print well and I am really not sure why:
2LU5 SOLID-STATE NMR CU COPPER (II) ION
Desired output that I am expecting is :
HEADER EXPDTA REMARK HETNAM FORMUL
OXIDOREDUCTASE 2LU5 SOLID-STATE NMR RESOLUTION. NOT APPLICABLE. COPPER (II) ION (here better to say last column because certain diversity exists before "copper") CU 2+
ZN ZINC ZN 2+
The root of your error is that:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #ac = ();
my $str = "REMARK 2 RESOLUTION. NOT APPLICABLE. ";
$str =~ s/^REMARK 2 RESOLUTION.//;
#ac = ( #ac, split '\s+', $str );
print Dumper \#ac;
The contents of #ac is:
$VAR1 = [
'',
'NOT',
'APPLICABLE.'
];
There is no $ac[3], you only have elements 0,1,2 in there.
With your #or error, you don't have any lines matching: /^SOURCE 2 ORGANISM_SCIENTIFIC/
So that array is empty, and that too, means you've got no $or[2] to print.
More generally - what you're doing here is actually really quite clunky, and there's a much cleaner solution.
How about:
#!/usr/bin/env perl
use strict;
use warnings;
#set the text "END" as our record separator
local $/ = 'END';
#define the fields to print out.
my #field_order = qw ( HEADER EXPDTA REMARK HETNAM FORMUL );
print join ( ",", #field_order), "\n"; #print header row
#iterate STDIN or file named on command line.
#just like you're doing with open (FILE, $ARGV[0])
while ( <> ) {
#select key value pairs into a hash - first word on the line is the 'key'
#and the value is 'anything else'.
my %this_entry = m/^(\w+)\s+(.*)$/gm;
next unless $this_entry{'HEADER'}; #check we have a header.
s/\s+/ /g for values %this_entry; #strip repeated spaces from fields;
s/\s+$//g for values %this_entry; #strip trailing whitespace.
#split 'header' row into separate subfields
#this is an example of how you could transform other fields.
($this_entry{'HEADER'}, $this_entry{'DATE'}, $this_entry{'STRUCT'} ) = split ' ', $this_entry{'HEADER'};
print join (",", #this_entry{#field_order} ), "\n";
}
This will - given your input - print:
HEADER,DATE,STRUCT,EXPDTA,REMARK,HETNAM,FORMUL
OXIDOREDUCTASE,08-JUN-12,2LU5,SOLID-STATE NMR,2 RESOLUTION. NOT APPLICABLE.,CU COPPER (II) ION,2 CU CU 2+
Which isn't quite what your output matches, but hopefully it's illustrated how much simpler this task could be?
Related
My mind seems to be missing a few screws today. I have an issue that I'm baffled by, but to be fair, I'm new to Perl scripting.
I am opening a csv file and need to look for duplicate values in one column, and where there are duplicates in this column, I need to add all values from another column for each duplicate together and print it on a new line in a new file.
open(my $feed, '<', $rawFile) or die "Could not locate '$rawFile'\n";
open(OUTPUT, '>', $newFile) or die "Could not locate '$newFile'\n";
while(my $line = <$feed>) {
chomp $line;
my #columns = split /,/, $line;
$Address= $columns[1];
$forSale= $columns[3];
}
I understand how to open the file and read it line by line. I know how to print results to new file. What I'm having trouble with is building logic to say, "For each Address in this extract that're duplicates, add all of their forSale's up and print the Address in new file with the added forSale's values. I hope this makes sense. Any assistance at all is encouraged.
The tool you need for this job is a hash.
This will allow you to 'key' things by Address:
my %sum_of;
while(my $line = <$feed>) {
chomp $line;
my #columns = split /,/, $line;
$Address= $columns[1];
$forSale= $columns[3];
$sum_of{$Address} += $forSale;
}
foreach my $address ( sort keys %sum_of ) {
print "$address => $sum_of{$address}\n";
}
Hello Chris Simmons,
I would like to add a few minor modification(s) on the perfect answer that Sobrique provided you.
You can open a file on the way you did but also you can open multiple files on the command line e.g. test.pl sample1.csv sample2.csv, you can read about it here eof.
I would also choose to check the file if it contains comma character (,) else print on terminal that this line can not be parsed.
Next step after splitting all values in the array I would trim the string(s) for white space leading and trailing.
Having said all that see solution bellow:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %hash;
while (<>) {
chomp;
if (index($_, ',') != -1) {
my #fields = split(/,/);
# remove leading and trailing white space
s{^\s+|\s+$}{}g foreach #fields;
$hash{$fields[0]} += $fields[3];
}
else {
warn "Line could not be parsed: $_\n";
}
} continue {
close ARGV if eof;
}
print Dumper \%hash;
__END__
$ perl test.pl sample.csv
$VAR1 = {
'123 6th St.' => 3,
'71 Pilgrim Avenue' => 5
};
__DATA__
123 6th St., Melbourne, FL 32904, 2
71 Pilgrim Avenue, Chevy Chase, MD 20815, 5
123 6th St., Melbourne, CT 06074, 1
Since you did not provide us sample of input data I created my own.
Another possible way is to use the module Text::CSV as ikegami proposed. Sample of code with the same checks that I mentioned earlier, see bellow:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dumper;
my $csv = Text::CSV->new({ sep_char => ',' });
my %hash;
while (<>) {
chomp;
if ($csv->parse($_)) {
my #fields = $csv->fields();
# remove leading and trailing white space
s{^\s+|\s+$}{}g foreach #fields;
$hash{$fields[0]} += $fields[3];
} else {
warn "Line could not be parsed: $_\n";
}
} continue {
close ARGV if eof;
}
print Dumper \%hash;
__END__
$ perl test.pl sample.csv
$VAR1 = {
'123 6th St.' => 3,
'71 Pilgrim Avenue' => 5
};
__DATA__
123 6th St., Melbourne, FL 32904, 2
71 Pilgrim Avenue, Chevy Chase, MD 20815, 5
123 6th St., Melbourne, CT 06074, 1
Hope this helps.
BR / Thanos
The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);
Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that
You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.
I'm trying to find the number of positive (P) and negative integers (N), number of words with all lower case characters(L),all upper case characters(F), Number of words with the first character capital and the rest of characters lower case(U).
List of words in alphabetical order together with the line number and the filename of each occurrence The following example illustrates the output of the program on sample input.
file1
Hello! world my friend. ALI went to school. Ali has -1 dollars and 10 TL
file2
Hello there my friend. VELI went to school. Veli has 10,
dollars and -10,TL
After you run your program,
>prog.pl file1 file2
the output you get is as follows:
N=2
P=2
L=18
F=4
U=4
-----------
ali file1 (1 1)
and file1 (2) file2 (2)
dollars file1 (2) file2 (2)
friend file1 (1) file2 (1)
has file1 (1) file2 (1)
hello file1 (1) file2 (1)
my file1 (1) file2 (1)
school file1 (1) file2 (1)
there file2 (1)
tl file1 (2) file2 (2)
to file1 (1) file2 (1)
veli file2 (1 1)
went file1 (1) file2 (1)
world file1 (1)
I tried to fill the entries,could you help me to deal with it?
#!/usr/bin/perl
$N= 0 ;
$P= 0 ;
$L= 0 ;
$F= 0 ;
$U= 0 ;
foreach __________ ( ____________) {__________________
or die("Cannot opened because: $!") ;
$lineno = 0 ;
while($line=<>) {
chomp ;
$lineno++ ;
#tokens = split $line=~ (/[ ,.:;!\?]+/) ;
foreach $str (#tokens) {
$N++ if ($str =~ /^-\d+$/) ;
$P++ if ($str =~ /^\d+$/) ;
$L++ if ($str =~ /^[a-z]+$/) ;
$F++ if ($str =~ /^[A-Z][a-z]+$/) ;
$U++ if ($str =~ /^[A-Z]+$/) ;
if ($str =~ /^[a-zA-Z]+$/) {
$str =~ __________________;
if ( (____________________) || ($words{$str} =~ /\)$/ ) ) {
$words{$str} = $words{$str} . " " . $file . " (" . $lineno ;
}
else {_______________________________________;
}}}}
close(FH) ;
foreach $w (__________________) {
if ( ! ($words{$w} =~ /\)$/ )) {
$words{$w} = ______________________;
}}}
print "N=$N\n" ;
print "P=$P\n" ;
print "L=$L\n" ;
print "F=$F\n" ;
print "U=$U\n" ;
print "-----------\n" ;
foreach $w (sort(keys(%words))) {
print $w," ", $words{$w}, "\n";
}
A few hints, and I'll let you get on your way...
Perl has what is called a diamond operator. This operator opens all files placed on the command line (which is read into the #ARGS array), and reads them line-by-line.
use strict;
use warnings;
use autodie;
use feature qw(say);
while my $line ( <> ) {
chomp $line;
say "The line read in is '$line'";
}
Try this program and run it as you would your program. See what happens.
Next, take a look at the Perl documentation for variables related to file handles. Especially take a look at the $/ variable. This variable is what used to break records. It's normally set to a new-line, so when you read in a file, you read it in line-by-line. You may want to try that. If not, you can fall back onto something like this:
use strict;
use warnings;
use autodie;
use feature qw(say);
while my $line ( <> ) {
chomp $line;
#words = split /\s+/, $line;
for my $word ( #words ) {
say "The word is '$word'";
}
}
Now you can use a hash to track which words were in each file and how many times. You can also track the various types of words you've mentioned. However, please don't use variables such as $U. Use $first_letter_uppercase. This will have more meaning in your program and will be less confusing for you.
Your teacher is teaching you the way Perl was written almost 30 years ago. This was back before God created the Internet. (Well, not quite. The Internet was already 10 years old, but no one outside of a few academics had heard of it). Perl programming has greatly evolved since then. Get yourself a good book on Modern Perl (that is Perl 5.x).
The pragmas at the beginning of my program (the use statements) do the following:
use strict - Use strict syntax. This does several things, but the main thing is to make sure you cannot use a variable unless you first declare it. (using most likely my). This prevents mistakes such as putting $name in one place, and referring to $Name in another place.
use warnings - This warns you of basic errors such as you're attempting to use a variable that isn't defined. By default, Perl assumes the variable is a null string or equal to zero if you use it in an arithmetic context. When you attempt to print or check a variable that hasn't been assigned a value. It probably means you have a logic mistake.
The above two pragmas will catch 90% of your errors.
use autodie - This will cause your program to automatically die in many circumstances. For example, you attempt to open a none existent file for reading. This way, you don't have to remember to check each instance of whether or not certain operations succeeded of failed.
use feature qw(say) - This allows you to use say instead of print. The say command is just like print, but automatically adds a new line on the end. It can make your code way cleaner and easier to understand.
For example:
print "N=$N\n" ;
vs.
say "N=$N" ;
Here's how I'd write that program. But it won't get you many marks as it's a long way from the "fill in the blanks" approach that your teacher is using. But that's good, because your teacher's Perl is very dated.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my ($N, $P, $L, $F, $U);
my %words;
while (<>) {
my #tokens = split /[^-\w]+/;
foreach my $token (#tokens) {
$N++ if $token =~ /^-\d+$/;
$P++ if $token =~ /^\d+$/;
next unless $token =~ /[a-z]/i;
$L++ if $token eq lc $token;
$U++ if $token eq uc $token;
$F++ if $token eq ucfirst lc $token;
push #{$words{lc $token}{$ARGV}}, $.;
}
close ARGV if eof;
}
say "N=$N";
say "P=$P";
say "L=$L";
say "F=$F";
say "U=$U";
for my $word (sort { $a cmp $b } keys %words) {
print "$word ";
for my $file (sort { $a cmp $b } keys %{$words{$word}} ) {
print "$file (", join(' ', #{$words{$word}{$file}}), ') ';
}
print "\n";
}
I have a text file that has approximately 3,000 lines. 99% of the time I need all 3,000 lines. However, periodically I will grep out the lines I need and direct the output to another text file to use.
The only problem I have in doing so, is: Embedded in the text file is a 6 character string of numbers that indicate the line number. In order to use the file, this area needs to be correctly renumbered...(I don't need to re-sort the data, but I need to replace the current six characters with the new line number. and it must be padded with zeros! Unfortuantely the entire rows is one long row of data with no field separators!
For example, my first three rows might look something like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
The six characters at positions 17-22 (Immediately following the "ZZ"), need be renumbered based on the current row number...so the above needs to look like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
Any ideas would be greatly appreciated!
Thanks,
KSL.
Here's the solution I came up with Perl. It assumes that the numbering is always 6 digits after the ZZ sequence.
In convert.pl:
use strict;
use warnings;
my $i = 1; # or the value you want to start numbering
while (<STDIN>) {
my $replace = sprintf("%06d", $i++);
$_ =~ s/ZZ\d{6}/ZZ$replace/g;
print $_;
}
In data.dat:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
To run:
cat data.dat | perl convert.pl
Output
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
If I would solve this, I would create a simple python script to read those lines by filtering as grep does and using a internal counter from inside the python script.
As simple hints you can read each line in a string and access them using variablename[17:22] (17:22 is the position of the string you are trying to use).
Now, there is a method in the string in python which does the replace, just replace the values by the counter you create.
I hope this helps.
To do this in awk:
awk '{print substr($0,1,16) sprintf("%06d", NR) substr($0,23)}'
or
gawk 'match($0,/^(.*ZZ)[0-9]{6}(.*)/,a) {print a[1] sprintf("%06d",NR) a[2]}'
This is exactly the type of thing where unpack is useful.
#!/usr/bin/env perl
use v5.10.0;
use strict;
use warnings;
while( my $line = <> ){
chomp $line;
my #elem = unpack 'A16 A6 A*', $line;
$elem[1] = sprintf '%06d', $.;
# $. is the line number for the last used file handle
say #elem;
}
Actually looking at the lines, it looks like there is date information stored in the first 14 characters.
Assuming that at some point you might want to parse the lines for some reason you can use the following as an example of how you could use unpack to split up the lines.
#!/usr/bin/env perl
use v5.10.0; # say()
use strict;
use warnings;
use DateTime;
my #date_elem = qw'
year month day
hour minute second
';
my #elem_names = ( #date_elem, qw'
ZZ
line_number
random_data
');
while( my $line = <> ){
chomp $line;
my %data;
#data{ #elem_names } = unpack 'A4 (A2)6 A6 A*', $line;
# choose either this:
$data{line_number} = sprintf '%06d', $.;
say #data{#elem_names};
# or this:
$data{line_number} = $.;
printf '%04d' . ('%02d'x5) . "%2s%06d%s\n", #data{ #elem_names };
# the choice will affect the contents of %data
# this just shows the contents of %data
for( #elem_names ){
printf qq'%12s: "%s"\n', $_, $data{$_};
}
# you can create a DateTime object with the date elements
my $dt = DateTime->new(
(map{ $_, $data{$_} } #date_elem),
time_zone => 'floating',
);
say $dt;
print "\n";
}
Although it would be better to use a regular expression, so that you could throw out bogus data.
use v5.14; # /a modifier
...
my $rdate = join '', map{"(\\d{$_})"} 4, (2)x5;
my $rx = qr'$rdate (ZZ) (\d{6}) (.*)'xa;
while( my $line = <> ){
chomp $line;
my %data;
unless( #data{ #elem_names } = $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
...
It would be better still; to use named capture groups added in 5.10.
...
my $rx = qr'
(?<year> \d{4} ) (?<month> \d{2} ) (?<day> \d{2} )
(?<hour> \d{2} ) (?<minute> \d{2} ) (?<second> \d{2} )
ZZ
(?<line_number> \d{6} )
(?<random_data> .* )
'xa;
while( my $line = <> ){
chomp $line;
unless( $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
my %data = %+;
# for compatibility with previous examples
$data{ZZ} = 'ZZ';
...
Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.