how to remove last single line available in file using perl - perl

how to remove last single line available in file using perl.
I have my data like below.
"A",1,-2,-1,-4,
"B",3,-5,-2.-5,
how to remove the last line... I am summing all the numbers but receiving a null value at the end.
Tried using chomp but did not work.
Here is the code currently being used:
while (<data>) {
chomp(my #row = (split ',' , $_ , -1);
say sum #row[1 .. $#row];
}

Try this (shell one-liner) :
perl -lne '!eof() and print' file
or as part of a script :
while (defined($_ = readline ARGV)) {
print $_ unless eof();
}

You should be using Text::CSV or Text::CSV_XS for handling comma separated value files. Those modules are available on CPAN. That type of solution would look like this:
use Text::CSV;
use List::Util qw(sum);
my $csv = Text::CSV->new({binary => 1})
or die "Cannot use CSV: " . Text::CSV->error_diag;
while(my $row = $csv->getline($fh)) {
next unless ($row->[0] || '') =~ m/\w/; # Reject rows that don't start with an identifier.
my $sum = sum(#$row[1..$#$row]);
print "$sum\n";
}
If you are stuck with a solution that doesn't use a proper CSV parser, then at least you'll need to add this to your existing while loop, immediately after your chomp:
next unless scalar(#row) && length $row[0]; # Skip empty rows.
The point to this line is to detect when a row is empty -- has no elements, or elements were empty after the chomp.

I suspect this is an X/Y question. You think you want to avoid processing the final (empty?) line in your input when actually you should be ensuring that all of your input data is in the format you expect.
There are a number of things you can do to check the validity of your data.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'sum';
use Scalar::Util 'looks_like_number';
while (<DATA>) {
# Chomp the input before splitting it.
chomp;
# Remove the -1 from your call to split().
# This automatically removes any empty trailing fields.
my #row = split /,/;
# Skip lines that are empty.
# 1/ Ensure there is data in #row.
# 2/ Ensure at least one element in #row contains
# non-whitespace data.
next unless #row and grep { /\S/ } #row;
# Ensure that all of the data you pass to sum()
# looks like numbers.
say sum grep { looks_like_number $_ } #row[1 .. $#row];
}
__DATA__
"A",1.2,-1.5,4.2,1.4,
"B",2.6,-.50,-1.6,0.3,-1.3,

Related

Perl grep through large file to match a string

I have an array (#array) which has list of elements. I need to check whether these each of the elements are exists in master file or not. If the element exists in master file then in the same line of master file the string YES (in 5th position) should also exists. And the element should be stored in different array.
Actually my script uses two grep shell command to achieve this. How can I write same thing in Perl do grep.
...
use Data::Dumper;
my #new_array;
my #array = ('RT0AC1', 'WG3RA3');
print Dumper(\#array);
foreach ( #array ){
my $line = `grep $_ "master_file.csv" | grep -i yes`;
next unless($line);
push( #new_array, $_ );
}
print Dumper(#new_array);
...
where master_file.csv looks like this:
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
So Here I am getting $line value as 102,RT0AC1,CONNECTED,WORKING,YES and element RT0AC1 is getting stored in #new_array.
How can I avoid using backtick(`) and two greps to achieve this. I am trying to do this using pure Perl. Also the master_file.csv contains millions of records.
Since all the words you're looking for are in the same location, it's easy to just split up the current line on commas and see if the second column exists in a hash table, and if the fifth column is equal to "YES":
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
my $filename = shift // "master_file.csv"; # Default filename if not given on command line
my #array = qw/RT0AC1 WG3RA3/; # Words you're looking for
my %words = map { $_ => 1 } #array; # Store them in a hash for fast lookup
my #new_array;
# Use Text::CSV_XS for non-trivial CSV files
open my $csv, "<", $filename;
while (<$csv>) {
chomp;
my #F = split /,/;
push #new_array, $F[1] if exists $words{$F[1]} && $F[4] eq "YES";
}
print Dumper(\#new_array);
Form regex to match records of interest, split line into fields and compare field #5 to YES. If there is a match increase a count for field #2 in %match hash.
Once the file processed %match hash will have matched records field #2 as a key and value will reflect how many times this field was matched with YES in the file.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<DATA>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
Output
$VAR1 = {
'RT0AC1' => 1
};
Remove DATA to get final code and give filename on command line to process file with data of interest
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
An alternative version based on regular expression without using split
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
my $regex = qr/^\d+,($re_filter),[^,]+,[^,]+,YES$/;
/$regex/ && $match{$1}++ for <DATA>;
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO

How to remove array's newlines and add an element at the beginning of it in Perl?

First of I have to apologize for editing my initial post. But after I provide my code I did the question fuzzy.
So, I have this an array (#start_cod) containing lines separated by /n as follows:
print #start_cod;
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
I need to remove the newlines and add ">text" ONLY at the beginning of the array as follow:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa
I tried:
s/\s+\z// for #start_cod;
print ">text#start_cod";
I tried also with chomp
chomp #start_cod;
print ">text#start_cod";
and
my #start_cod = split("\n",$start_cod);
$start_cod = join("",#start_cod);
print ">text$start_cod";
but I get
aaaaaaaaaaaaaaaaaaa>textcacacacacaacaccacaac>textaattatatattataattatatttat
Any suggestions on how to handle this in Perl Programming?
Here is my code which works 100%.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my %alliloux =();
$/="\n>";
while (<>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
my ($sp, $head) = split (/\./, $onoma);
push #{ $alliloux{$sp} }, join "\n", ">$onoma", #seq;
}
foreach my $sp (keys %alliloux) {
chomp $sp;
my ($head, $dna) = split(/\t/, $sp);
my #start_cod = substr($dna, 3);
say #start_cod;
Input file:
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
>namex aattatatattataattatatttat
output after Perl run
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
Desired output:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa
If I understand your question correctly, this should do what you want:
use strict;
use warnings;
my #start_cod = (
'aaaaaaaaaaaaaaaaaa',
'acacacacacaacaccacaac',
'aattatatattataattatatttat',
);
print ">text\n", #start_cod, "\n";
The print first prints ">text" and a newline once, then you get the #start_cod items on a line, and the last "\n" makes sure you have a newline after the last element.
Output:
>text
aaaaaaaaaaaaaaaaaaacacacacacaacaccacaacaattatatattataattatatttat
You might want to see Read FASTA into Hash. It's the same problem and very close to the code I wrote before I read it. Also, there are modules on CPAN that can handle FASTA.
I think you want to combine the sequences that start with the same name, disregarding the numbers. The sequences shouldn't have interior whitespace. In your code, you are constantly adding whitespace. You even join on a newline. So, you go to the doctor and say "My arm hurts when I do this", and the doctor says "So don't do that". :)
When you run into these sort of problems, check the results of your operations at each step to see if you get what you expect. Here's a much simplified version of a program that I think does what you want. I've removed most of the data structure because they are complicating your process.
In short, read a line and remove the newline at the end. That's one source of your newlines. Then, extract the sequence and concatenate that to the previous sequence. When you join with newlines, you are adding newlines. So, don't do that:
use v5.14;
use warnings;
use Data::Dumper;
my %alliloux = ();
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
# now split on whitespace, but only up to two parts.
# There's no array here.
my( $name, $seq ) = split /\s+/, $_, 2;
# remove the numbers at the end to get the prefix of the
# name.
my $prefix = $name =~ s/\d+\z//r;
# append the current sequence for this prefix to what we
# have already seen.f
$alliloux{$prefix} .= $seq;
}
say Dumper( \%alliloux );
foreach my $base ( keys %alliloux ) {
say ">text $alliloux{$base}";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
You don't need the intermediate array. You can build up your string as you go. You don't need to have all the parts before you do that.
Now, to figure out where you might be going wrong, do a little at once. Ensure that you've extracted the right thing. It's handle to put characters around the variables you interpolate so you can see whitespace at the beginning or end:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
}
Then, add another step, and ensure that works:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
my $prefix = $name =~ s/\d+\z//r;
say "Prefix: <$prefix>";
}
Repeat this process for each step. Then, when you come with a question, you've pinpointed the point where things diverge. Here's the same technique in your program:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
say "Onoma: <$onoma>";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
The output shows that you never had anything in #seq. You are splitting on a newline, but unless you've changed the default line ending, you'll only get a newline at the end:
Onoma: <name aaa>
Onoma: <name2 cccc>
Onoma: <name99 aattaatt>
Now there's nothing in #seq, so a line like join "\n", ">$onoma", #seq; is really just join "\n", ">$onoma". You could have seen that with a little checking.
The description lacks clarity of the problem.
By looking at the desired output the following code comes to mind. Please see if it does what you was looking for.
Even looking at your code it is not clear what you try to do -- some part of the code does not make much sense.
use strict;
use warnings;
use feature 'say';
my #start_cod;
while( <DATA> ) {
chomp;
next unless />\s?name.?\s+(.*)/;
push #start_cod, $1;
}
print ">text\n " . join('',#start_cod);
__DATA__
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
.
.
.
> namex aattatatattataattatatttat

How can I get the last element of each line in perl?

I have to get the last element in each line. I am using perl..
1.25.56.524.2
2.56.25.254.3
2.54.28.264.2
Just split each line on a dot, last element has the index -1:
print +(split /\./)[-1] while <>;
One possibility would be:
use strict;
use warnings;
my #Result; # Array holding the results
while (<DATA>) # whatever you use to provide the lines...
{
chomp; # good practice, usually necessary for STDIN
my #Tokens = split(/\./); # assuming that "." is the separator
push( #Result , $Tokens[-1] );
}
__DATA__
1.25.56.524.2
2.56.25.254.3
2.54.28.264.2
I assume by last element you mean the last value separated by .. Have a look at this code:
use strict;
my #last_values; # UPDATE: initialize array
for my $line (<DATA>) { # read line by line
chomp $line; # remove newline at the end
my #fields = split '\.', $line; # split to fields
my $last = pop #fields; # get last field
print $last."\n";
push #last_values, $last; # UPDATE: store last field in array
}
__DATA__
1.25.56.524.2
2.56.25.254.3
2.54.28.264.2
Output:
2
3
2

Perl: Manipulating while(<>) loop in file reading

My question is regarding the while loop that reads line from files. The situation is that I want to store values from or the entire next line when the while loop while(<FILEHANDLE>) is performing the action on present line ($_). So what is the way to address this problem? Is there a specific function or module that does this thing?
If you want to process four lines at a time and each set of lines is separated by #FCC then you need to change perl's input file separator.
In your script put
$/="\#FCC"
This means that when you do (<>), each record you get in $_ is now four lines of your file.
use warnings;
use strict;
local $/="\#FCC";
while (<>) {
chomp;
#Each time we iterate, $_ is now all four lines of each record.
}
Edit
You'll need to backslash the #
You can read from <> anywhere, not just in the head of the loop, e.g.
while (my $line = <>) {
chomp $line;
my $another_line = <>;
chomp $another_line;
print "$line followed by $another_line\n";
}
Assuming your file is small-ish (perhaps less than 1gb) you could just stuff it into an array and walk it:
use warnings;
use strict;
my #lines;
while (<>) {
chomp;
push #lines, $_;
}
my $num_lines = #lines; #use of array in scalar context give length of array
# don't do last line (there is no next one)
$num_lines -= 1;
foreach (my $i = 0; $i < $num_lines; $i++) {
my $next_line = $i+1;
print "line $i plus $next_line:",$lines[$i],$lines[$i+1],"\n";
}
Note that the semantics of my solution is a bit different from the answer above. My solution would print out everything except the first line twice; if you wanted everything to be printed once, the above solution might make more sense.
If you want to read n lines at a time from a file you can use Tie::File and use an array to reference n elements at a time, like this:
use strict;
use warnings;
use Tie::File;
my $filename = 'path_to_your_file';
tie my #array, 'Tie::File', $filename or die 'Unable to open file';
my $index = 0;
my $size = #array;
while (1) {
last if ($index > $size); # Be careful! Try to do a better check than this!
print #array[$index..$index+3];
print "----\n";
$index += 4;
}
(This is just an example, try to write better code)
As the documentation says, the file is not loaded into memory all at once, so it will work even for large files.

Perl- Extract each line from a txt file and store into different variables

I readin a txt file using a perl script, but im wondering how to store each line from the txt file into a different variable in the perl script using pattern matching. I can match a line using ~^>gi , but it displays both lines from the txt file with >gi (i.e line 1 & 3), also i want to read the two separate DNA sequences into different variables. Consider my example below.
file.txt
>gi102939
GATCTATC
>gi123453
CATCGACA
the perl script:
#!/usr/local/bin/perl
open (MYFILE, 'file.txt');
#array = <MYFILE>;
($first, $second, $third, $fourth, $fifth) = #array;
chomp $first, $second, $third, $fourth, $fifth;
print "Contents:\n #array";
if (#array =~ /^>gi/)
{
print "$first";
}
close (MYFILE);
Assuming that >gi.. are unique in the input, populate a hash where each key is associated with a sequence:
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
my $last;
while (<DATA>) {
chomp;
if (/^>gi/) {
$last = $_;
} else {
$hash{$last} = $_;
}
}
foreach my $k (keys %hash) {
print "$k => $hash{$k}\n";
}
__DATA__
>gi102939
GATCTATC
>gi123453
CATCGACA
Please always use strict and use warnings at the top of your program, and declare your variables using my at their first point of use. This applies epecially when you are asking for help, as doing so can frequently reveal simlpe problems that could otherwise be overlooked.
As it stands, your program will read the file into #array and print it out. The test if (#array =~ /^>gi/) { ... } will force scalar context on the array, and so compare the number of elements in the array, presumably 5, with the regex pattern and fail.
What exactly are you trying to achieve? Reading a file into an array puts each line into a different scalar variables - the variables being the elements of the array
This one-liner reads the database and extracts one element:
perl < file.txt -e '#array=<>;chomp #array;%hash=#array;print $hash{">gi102939"}'
result:
GATCTATC