For Every Line in File Sum Values - perl

I am trying to make a Perl script that goes through and loops through a file, does some Regex to get number that are surrounded by parenthesis, add them up for every line.
#file
Awaiting_Parts_Bin(2),Inspection_Bin(1),Picked-1-3888(1),Picked-1-4364(2)
Picked-1-3890(1)
Picked-1-4364(1)
Picked-1-3888(4),Picked-1-3890(2),Picked-1-4364(1),Picked-1-7202(1)
Awaiting_Parts_Bin(1)
Desired Output
#new_file
6
1
1
8
1
Perl script
#!/usr/bin/perl
use strict;
use warnings;
my $file = '/Users/.....csv';
my $new_file = '/Users/.....csv';
open(my $fh, '<', $file)
or die "cannot open file";
open(my $new_fh, '>', $new_file)
or die "cannot open file";
my $sum = 0;
while (my $line = <$fh>){
my #arr = ( $line =~ /\(([0-9]+)\)/g);
foreach my $val ( #arr ) {
$sum += $val;
print $sum, "\n";
# this makes sense that it is resetting to zero while looping. This is just one variation I tried. I tried putting the sum=0 outside the loop and it made it a running total
$sum = 0;
}
}
No matter what I try I can't get it right. This code doesn't add all of the values it takes the last one in the file. So the output would look like this
#output now
2
1
1
1
1
Or if I remove the $sum=0 in the for loop then it makes it a running total.

You need to reset $sum to zero at the start of the outer (per line) while loop, not in the inner (per element) foreach loop.
This can be achieved by putting the declaration and initial assignment of $sum inside the while loop:
while (my $line = <$fh>) {
my $sum = 0;
my #arr = ( $line =~ /\(([0-9]+)\)/g);
foreach my $val ( #arr ) {
$sum += $val;
}
print $sum, "\n";
}

Like this?
use strict;
use warnings 'all';
use List::Util 'sum';
while ( <> ) {
my $sum = sum /\((\d+)\)/g;
print "$sum\n" if defined $sum;
}
output
6
1
1
8
1

Related

Perl simple filehandling of text

What this program is meant to do is that it reads a text file which looks like:
Item \t\t Price
apple \t\t 20
orange \t\t 50
lime \t\t 30
I'm using split function to split these 2 columns and then i should apply a -25% discount on all items and print it out to a new file. My code so far does what i want but the new text file has a '0' value under my last number in price column. I also get 2 errors if i run it with "use warnings" which are:
Use of uninitialized value $item in multiplication * ...
Use of uninitialized value $item[0] in concatenation (.) ...
I should also tell total number of items calculated but i get like 5 1's instead of 5. (11111 instead of 5)
use strict;
use warnings;
my $filename = 'shop.txt';
if (-e $filename){
open (IN, $filename);
}
else{
die "Can't open input file for reading: $!";
}
open (OUT,">","discount.txt") or die "Can't open output file for writing: $!";
my $header = <IN>;
print OUT $header;
while (<IN>) {
chomp;
my #items = split(/\t\t/);
foreach my $item ($items[1]){
my $discount = $item * (0.75);
print OUT "$items[0]\t\t$discount\n";
}
}
This is too complicated and not clear what are you doing in foreach loop and you are not skipping empty lines. Keep it simple:
use warnings;
use strict;
use v5.10;
<>; # skip header
while(my $line = <>)
{
chomp $line;
next unless ($line);
my ($title, $price ) = split /\s+/, $line;
if( $title && defined $price )
{
$price *= 0.75;
say "$title\t\t$price";
}
}
and run like
perl script.pl <input.txt >output.txt
use strict;
use warnings;
my $filename = 'shop.txt';
if (-e $filename){
open (IN, $filename);
}
else{
die "Can't open input file for reading: $!";
}
open (OUT,">","discount.txt") or die "Can't open output file for writing: $!";
my $header = <IN>;
my $item;
my $price;
print OUT $header;
while (<IN>) {
chomp;
($item, $price) = split(/\t\t/);
my $discount = $price*0.75;
print OUT "$item $discount\n";
}
This should help! :)
If the total item count isn't very important to you:
$ perl -wane '$F[1] *= 0.75 if $. > 1; print join("\t", #F), "\n";' input.txt
Output:
Item Price
apple 15
orange 37.5
lime 22.5
If you really need the total item count:
$ perl -we 'while (<>) { #F = split; if ($. > 1) { $F[1] *= 0.75; $i++ } print join("\t", #F), "\n"; } print "$i items\n";' input.txt
Output:
Item Price
apple 15
orange 37.5
lime 22.5
3 items
I'd use this approach
#!/usr/bin/perl
use strict;
use warnings;
my %items;
my $filename = 'shop.txt';
my $discount = 'discount.txt';
open my $in, '<', $filename or die "Failed to open file! : $!\n";
open my $out, ">", $discount or die "Can't open output file for writing: $!";
print $out "Item\t\tPrice\n";
my $cnt = 0;
while (my $line = <$in>) {
chomp $line;
if (my ($item,$price) = $line =~ /(\w.+)\s+([0-9.]+)/){
$price = $price * (0.75);
print $out "$item\t\t$price\n";
$items{$item} = $price;
$cnt++;
}
}
close($in);
close($out);
my $total = keys %items;
print "Total items - $total \n";
print "Total items - $cnt\n";
Using regex capture groups to capture the item and price (using \w.+ in case the item is 2 words like apple sauce), this will also prevent empty lines from printing to file.
I also hard coded the Item and Price header, probably a good idea if you are going to be using a consistent header.
Hope it helps
---Update ----
I added 2 examples of a total count in my script. The first one is using a hash and printing out the hash size, the second method is using a counter. The hash option is good except if your list has 2 items that are the same in which case the key of the hash will be overridden with the last item found which shares the same name. The counter is a simple solution.

Perl: averaging numbers in a file

I have a file where each line consists of a numerical value:
1
2
3
3
1
My function looks something like this:
print "Enter file name to average \n";
$infile = <>;
open IN, "$infile";
$total = 0;
$count =0;
while (my $line = <>) {
$total +=$line;
$count ++=;
}
print "Average = ", $total / $count, "\n";
close(IN);
But I'm getting an error at the $count ++=; line saying that there's a syntax error near "+=;".
Just do $count++, no =.
See http://perldoc.perl.org/perlop.html#Auto-increment-and-Auto-decrement.
No need of = after increment $count++. In simple way you can do like this:
#!/usr/bin/perl
use strict;
use warnings;
my $total = 0;
my $count = 0;
while (<>)
{
$total += $_;
$count++;
}
print "Average = ", $total/$count, "\n";
_____file.txt_____
1
2
3
3
1
Execute you program as:
./scriptname file.txt
Output:
Average = 2
The ideal code for your problem should look like below:
#!/usr/bin/perl
use strict;
use warnings;
print "Enter file name to average \n";
chomp( my $infile = <> ); #remove new line characters
open my $IN, '<', $infile
or die "unable to open file: $! \n"; #use 3 arg file open syntax
my $total = 0;
my $count = 0;
while ( my $line = <$IN> ) {## you should use <> on file handle
chomp $line;
next if ($line=~/^\s*$/); #ignore blank lines
$total += $line;
$count++;
}
if($count > 0){
print "Average = ", $total / $count, "\n";
}
else{
print "Average cannot be calculated Check if file is blank \n";
}
close($IN);

Check how many "," in each line in Perl [duplicate]

This question already has answers here:
Counting number of occurrences of a string inside another (Perl)
(4 answers)
Closed 7 years ago.
I have to check how many times was "," in each line in file. Anybody have idea how can I do it in Perl?
On this moment my code looks like it:
open($list, "<", $student_list)
while ($linelist = <$list>)
{
printf("$linelist");
}
close($list)
But I have no idea how to check how many times is "," in each $linelist :/
Use the transliteration operator in counting mode:
my $commas = $linelist =~ y/,//;
Edited in your code :
use warnings;
use strict;
open my $list, "<", "file.csv" or die $!;
while (my $linelist = <$list>)
{
my $commas = $linelist =~ y/,//;
print "$commas\n";
}
close($list);
If you just want to count the number of somethings in a file, you don't need to read it into memory. Since you aren't changing the file, mmap would be just fine:
use File::Map qw(map_file);
map_file my $map, $filename, '<';
my $count = $map =~ tr/,//;
#! perl
# perl script.pl [file path]
use strict;
use warnings;
my $file = shift or die "No file name provided";
open(my $IN, "<", $file) or die "Couldn't open file $file: $!";
my #matches = ();
my $index = 0;
# while <$IN> will get the file one line at a time rather than loading it all into memory
while(<$IN>){
my $line = $_;
my $current_count = 0;
# match globally, meaning keep track of where the last match was
$current_count++ while($line =~ m/,/g);
$matches[$index] = $current_count;
$index++;
}
$index = 0;
for(#matches){
$index++;
print "line $index had $_ matches\n"
}
You can use mmap Perl IO layer instead of File::Map. It is almost as efficient as former but most probably present in your Perl installation without needing installing a module. Next, using y/// is more efficient than m//g in array context.
use strict;
use warnings;
use autodie;
use constant STUDENT_LIST => 'text.txt';
open my $list, '<:mmap', STUDENT_LIST;
while ( my $line = <$list> ) {
my $count = $line =~ y/,//;
print "There is $count commas at $.. line.\n";
}
If you would like grammatically correct output you can use Lingua::EN::Inflect in the right place
use Lingua::EN::Inflect qw(inflect);
print inflect "There PL_V(is,$count) $count PL_N(comma,$count) at ORD($.) line.\n";
Example output:
There are 7 commas at 1st line.
There are 0 commas at 2nd line.
There is 1 comma at 3rd line.
There are 2 commas at 4th line.
There are 7 commas at 5th line.
Do you want #commas for each line in the file, or #commas in the entire file?
On a per-line basis, replace your while loop with:
my #data = <list>;
foreach my $line {
my #chars = split //, $line;
my $count = 0;
foreach my $c (#chars) { $count++ if $c eq "," }
print "There were $c commas\n";
}

Take random substrings from genome data

I am trying to use the substring function to take random 21 base sequences from a genome in fasta format. Below is the start of the sequence:
FILE1 data:
>gi|385195117|emb|HE681097.1| Staphylococcus aureus subsp. aureus HO 5096 0412 complete genome
CGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTTGAATACCTTATACAGTTCTTATACATAC
I have tried adapting a previous answer to use while reading my file and i'm not getting any error messages, just no output! The code hopefully prevents there being any overlap of sequences, though the chances of that are very small anyway.
Code as follows:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/randomoutput.txt";
open FILE1, "/Users/edwardtickle/Documents/EMRSA-15.fasta";
open( OUTPUTFILE, ">$outputfile" );
while ( my $line = <FILE1> ) {
if ( $line =~ /^([ATGCN]+)/ ) {
my $genome = $1;
my $size = 21;
my $count = 5;
my $mark = 'X';
if ( 2 * $size * $count - $size - $count >= length($genome) ) {
my #substrings;
while ( #substrings < $count ) {
my $pos = int rand( length($genome) - $size + 1 );
push #substrings, substr( $genome, $pos, $size, $mark x $size )
if substr( $genome, $pos, $size ) !~ /\Q$mark/;
for my $random (#substrings) {
print OUTPUTFILE "random\n";
}
}
}
}
}
Thanks for your help!
One of the neatest ways to select a random start point is to shuffle a list of all possible start points and select the first few -- as many as you need.
It's also best practice to use the three-parameter form of open, and lexical file handles.
The loop in this example starts much like your own -- picking up the genomes using a regex. The subsequences of length $size can start anywhere from zero up to $len_genome - $size, so the program generates a list of all these starting points, shuffles them using the utility function from List::Util, and puts them in #start_points.
Finally, if there are sufficient start points to form $count different subsequences, then they are printed, using substr in the print statement.
use strict;
use warnings;
use autodie;
use List::Util qw/ shuffle /;
my $outputfile = '/Users/edwardtickle/Documents/randomoutput.txt';
open my $in_fh, '<', '/Users/edwardtickle/Documents/EMRSA-15.fasta';
open my $out_fh, '>', $outputfile;
my $size = 21;
my $count = 5;
while (my $line = <$in_fh>) {
next unless $line =~ /^([ATGCN]+)/;
my $genome = $1;
my $len_genome = length $genome;
my #start_points = shuffle(0 .. $len_genome-$size);
next unless #start_points >= $count;
print substr($genome, $_, 21), "\n" for #start_points[0 .. $count-1];
}
output
TACACGATGCGAGCAATCAAA
GTTTACAATGTTTGAATACCT
ACATCACCATGAGTTTGGTCC
ATAACATCACCATGAGTTTGG
GGTCCGAAGCATGAGTGTTTA
I would recommend saving all possible positions for a substring in an array. That way you can remove possibilities after each substring to prevent overlap:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $infile = "/Users/edwardtickle/Documents/EMRSA-15.fasta";
my $outfile = "/Users/edwardtickle/Documents/randomoutput.txt";
my $size = 21;
my $count = 5;
my $min_length = ( $count - 1 ) * ( 2 * $size - 1 ) + $size;
#open my $infh, '<', $infile;
#open my $outfh, '>', $outfile;
my $infh = \*DATA;
my $outfh = \*STDOUT;
while ( my $line = <$infh> ) {
next unless $line =~ /^([ATGCN]+)/;
my $genome = $1;
# Need a long enough sequence for multiple substrings with no overlap
if ( $min_length > length $genome ) {
warn "Line $., Genome too small: Must be $min_length, not ", length($genome), "\n";
next;
}
# Save all possible positions for substrings in an array. This enables us
# to remove possibilities after each substring to prevent overlap.
my #pos = ( 0 .. length($genome) - 1 - ( $size - 1 ) );
for ( 1 .. $count ) {
my $index = int rand #pos;
my $pos = $pos[$index];
# Remove from possible positions
my $min = $index - ( $size - 1 );
$min = 0 if $min < 0;
splice #pos, $min, $size + $index - $min;
my $substring = substr $genome, $pos, $size;
print $outfh "$pos - $substring\n";
}
}
__DATA__
>gi|385195117|emb|HE681097.1| Staphylococcus aureus subsp. aureus HO 5096 0412 complete genome
CGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTTGAATACCTTATACAGTTCTTATACATACCGATTAAAGATAGAAATACACGATGCGAGCAATCAAA
CGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTTGAATACCTTATACAGTTCTTATACATACCGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTTGAATACCTTATACAGTTCTTATACATAC
Outputs:
Line 2, Genome too small: Must be 185, not 154
101 - CAGTTCTTATACATACCGATT
70 - ATGAGTGTTTACAATGTTTGA
6 - AAGATAGAAATACACGATGCG
38 - TTCATAACATCACCATGAGTT
182 - GAAGCATGAGTGTTTACAATG
Alternative method for large genomes
You mentioned in a comment that genome may be 2 gigs in size. If that's the case then it's possible that there won't be enough memory to have an array of all possible positions.
Your original approach of substituting a fake character for each chosen substring would work in that case. The following is how I would do it, using redo:
for ( 1 .. $count ) {
my $pos = int rand( length($genome) - ( $size - 1 ) );
my $str = substr $genome, $pos, $size;
redo if $str =~ /X/;
substr $genome, $pos, $size, 'X' x $size;
print $outfh "$pos - $str\n";
}
Also note, that if your genome really is that big, then you must also be wary of the randbits setting of your Perl version:
$ perl -V:randbits
randbits='48';
For some Windows versions, the randbits setting was just 15, therefore only returning 32,000 possible random values: Why does rand($val) not warn when $val > 2 ** randbits?
I found it more effective to move the output for loop outside the inner while, and to add a condition to the while such that $genome must contain a $size-long chunk that hasn't already been partly selected.
Just because you've got a string that's 117 characters long doesn't mean you'll find 5 random non-overlapping chunks.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/randomoutput.txt";
open FILE1, "/Users/edwardtickle/Documents/EMRSA-15.fasta";
open( OUTPUTFILE, ">$outputfile" );
while ( my $line = <FILE1> ) {
if ( $line =~ /^([ATGCN]+)/ ) {
my $genome = $1;
my $size = 21;
my $count = 5;
my $mark = 'X';
if ( 2 * $size * $count - $size - $count >= length($genome) ) {
my #substrings;
while ( #substrings < $count
and $genome =~ /[ATGCN]{$size}/ ) { # <- changed this
my $pos = int rand( length($genome) - $size + 1 );
push #substrings, substr( $genome, $pos, $size, $mark x $size )
if substr( $genome, $pos, $size ) !~ /\Q$mark/;
}
# v- changed this
print OUTPUTFILE "$_\n" for #substrings;
}
}
}

How can I find the elements appearing in two columns of a tab-delimited file?

I have a file which is tab delimited, and has two columns, A and B.
I want to count the number of times an element in B is repeated in A. I could have done it in Excel, but since the two columns contain more than 200k elements, it hangs.
I tried with this code but it counts elements in itself:
my %counts = ();
for (#A) {
$count{$_}++;
}
foreach my $k(keys %counts) {
print "$k\t$count{$k}\n";
}
Try this solution:
use strict;
use warnings;
my %countx;
my #y;
my $file = 'ab.txt';
open my $fh, '<', $file or die "Couldn't open $file";
while (my $line = <$fh>) {
chomp $line; # remove newline
# I've avoided using $a and $b because they are special variables in perl
my ( $x, $y ) = split /\t/, $line;
$countx{ $x }++;
push #y, $y;
}
close $fh;
foreach my $y (#y) {
my $count = $countx{ $y } || 0;
print "$y\t$count\n";
}