Perl: Grep unique value - perl

Basically I wanted to emulate the piped grep operation as we do in shell script, (grep pattern1 |grep pattern2) in my Perl code to make the result unique.
Below code is working, bust just wanted to know this is the right approach. Please note, I don't want to introduce a inner loop here, just for the grep part.
foreach my $LINE ( #ARRAY1 ) {
#LINES = split /\s+/, $LINE;
#RESULT= grep ( /$LINES[0]/, ( grep /$LINES[1]/, #ARRAY2 ) );
...

This is basically same thing what you're doing, "for every #ARRAY2 element, check whether it matches ALL elements from #LINES" (stop as soon as any of the #LINES element does not match),
use List::Util "none";
my #RESULT= grep { my $s = $_; none { $s !~ /$_/ } #LINES } #ARRAY2;
# index() is faster for literal values
my #RESULT= grep { my $s = $_; none { index($s, $_) <0 } #LINES } #ARRAY2;

There is no need to cascade calls to grep -- you can simply and the conditions together
It's also worth saying that you should be using lower-case letters for your identifiers, and split /\s+/ should almost always be split ' '
Here's what I would write
for my $line ( #array1 ) {
my #fields = split ' ', $line;
my #result = grep { /$fields[0]/ and /$fields[1] } #array2;
...
}

There are different ways to grep/extract unique values from array in perl.
##2) Best of all
my %hash = map { $_ , 1 } #array;
my #uniq = keys %hash;
print "\n Uniq Array:", Dumper(\#uniq);
##3) Costly process as it involves 'greping'
my %saw;
my #out = grep(!$saw{$_}++, #array);
print "\n Uniq Array: #out \n";

Related

How to separate an array in Perl based on pattern

I am trying to write a big script but I am stuck on a part. I want to sprit an array based on ".."
From the script I got this:
print #coordinates;
gene complement(872..1288)
my desired output:
complement 872 1288
I tried:
1) my #answer = split(.., #coordinates)
print("#answer\n");
2) my #answer = split /../, #coordinates;
3) print +(split /\../)[-1],[-2],[-3] while <#coordinates>
4) foreach my $anwser ( #coordinates )
{$anwser =~ s/../"\t"/;
print $anwser;}
5) my #answer = split(/../, "complement(872..1288)"); #to see if the printed array is problematic.
which prints:
) ) ) ) ) ) ) ) )
6) my #answer = split /"gene "/, #coordinates; # I tried to "catch" the entire output's spaces and tabs
which prints
0000000000000000000000000000000001000000000100000000
But none of them works. Does anyone has any idea how to step over this issue?
Ps, unfortunately, I can't run my script right now on Linux so I used this website to run my script. I hope this is not the reason why I didn't get my desired output.
my $RE_COMPLEMENT = qr{(complement)\((\d+)\.\.(\d+)\)}msx;
for my $item (#coordinates) {
my ($head, $i, $j) = $item =~ $RE_COMPLEMENT;
if (defined($head) && defined($i) && defined($j)) {
print("$head\t$i\t$j\n");
}
}
split operates on a scalar, not on an array.
my $string = 'gene complement(872..1288)';
my #parts = split /\.\./, $string;
print $parts[0]; # gene complement(872
print $parts[1]; # 1288)
To get the desired output, you can use a substitution:
my $string = 'gene complement(872..1288)';
$string =~ s/gene +|\)//g;
$string =~ s/\.\./ /;
$string =~ s/\(/ /;
Desired effect can be achieved with
use of tr operator to replace '(.)' => ' '
then splitting data string into element on space
storing only required part of array
output elements of array joined with tabulation
use strict;
use warnings;
use feature 'say';
my $data = <DATA>;
chomp $data;
$data =~ tr/(.)/ /;
my #elements = (split ' ', $data)[1..3];
say join "\t", #elements;
__DATA__
gene complement(872..1288)
Or as an alternative solution with only substitutions (without splitting data string into array)
use strict;
use warnings;
use feature 'say';
my $data = <DATA>;
chomp $data;
$data =~ s/gene\s+//;
$data =~ s/\)//;
$data =~ s/[(.]+/\t/g;
say $data;
__DATA__
gene complement(872..1288)
Output
complement 872 1288

Ignore the first two lines with ## in perl

all.
Im a newbie in programming especially in perl. I would like to skip the first two lines in my dataset.
these are my codes.
while (<PEPTIDELIST>) {
next if $_ !=~ "##";
chomp $_;
#data = split /\t/;
chomp $_;
next if /Sequence/;
chomp $_;
$npeptides++;
# print "debug: 0: $data[0] 1: $data[1] 2: $data[2] 3:
$data[3]
\n" if ( $debug );
my $pepseq = $data[1];
#print $pepseq."\n";
foreach my $header (keys %sequence) {
#print "looking for $pepseq in $header \n";
if ($sequence{$header} =~ /$pepseq/ ) {
print "matched $pepseq in protein $header" if ( $debug );
# my $in =<STDIN>;
if ( $header =~ /(ENSGALP\S+)\s.+(ENSGALG\S+)/ ) {
print "debug: $1 $2 have the pep = $pepseq \n\n" if (
$debug);
my $lprot = $1;
my $lgene = $2;
$gccount{$lgene}++;
$pccount{$lprot}++;
# print "$1" if($debug);
# print "$2" if ($debug);
print OUT "$pepseq,$1,$2\n";
}
}
}
my $ngenes = keys %gccount;
my $nprots = keys %pccount;
somehow the peptide is not in the output list. please help point me where it goes wrong?
thanks
If you want to skip lines that contain ## anywhere in them:
next if /##/;
If you only want to skip lines that start with ##:
next if /^##/;
If you always want to skip the first two lines, regardless of content:
next if $. < 3;
next if $_ !=~ "##"; must be next if $_ =~ "##";
Ignore this lie if $_ matched ##

Perl Regular expression extract

I'm trying to extract a certain string of numbers from a text file using a regular exression, but when my code runs, it is grabbing the numbers after the slash in the separation between date and time. Here is what I have so far.
while ( <INFILE> ) {
my #fields = split( /\ /, $_ );
my #output;
foreach my $field ( #fields ) {
if ( $field =~ /[0-9]{5}\// ) {
push #output, $field;
}
}
if ( #output ) {
my $line = join( ' ', #output );
print "$line\n";
print OUTFILE "$line\n";
}
}
The line I am trying to extract data from is
D2001235 9204 254/2004 254/1944 254/2041 15254/2011 ALL-V4YM 001 AUTO C-C0000
The data I need is the 15254 but when I run my code it returns 15254/2011 and my program is erroring out.
The problem is that you are storing the entire $field in the output array, but you only want the number to the left of the slash to be stored. You could use capturing parentheses in the regular expression and the $1 special variable. This outputs 15254:
use warnings;
use strict;
while (<DATA>) {
my #fields = split( /\ /, $_ );
my #output;
foreach my $field (#fields) {
if ( $field =~ /^([0-9]{5})\// ) {
push #output, $1;
}
}
if (#output) {
my $line = join( ' ', #output );
print "$line\n";
}
}
__DATA__
D2001235 9204 254/2004 254/1944 254/2041 15254/2011 ALL-V4YM 001 AUTO C-C0000
As explained, you are saving an entire field in #output if it matches the regex, instead of just the first part before the slash
Your split is also unnecessarily complicated, and join isn't needed
All you need is this
while ( <INFILE> ) {
my #output = map m{^([0-9]{5})/}, split;
if ( #output ) {
print "#output\n";
print OUTFILE "#output\n";
}
}

Perl - sort filenames by order based on the filemask YYY-MM-DD that is in the filename

Need some help, not grasping a solution here on what method I should use.
I need to scan a directory and obtain the filenames by order of
1.YYYY-MM-DD, YYYY-MM-DD is part of the filename.
2. Machinename which is at the start of the filename to the left of the first "."
For example
Machine1.output.log.2014-02-26
Machine2.output.log.2014-02-27
Machine2.output.log.2014-02-26
Machine2.output.log.2014-02-27
Machine3.output.log.2014-02-26
So that it outputs in an array as follows
Machine1.output.log.2014-02-26
Machine2.output.log.2014-02-26
Machine3.output.log.2014-02-26
Machine1.output.log.2014-02-27
Machine2.output.log.2014-02-27
Thanks,
Often, temporarily turning your strings into a hash or array for sorting purposes, and then turning them back into the original strings is the most maintainable way.
my #filenames = qw/
Machine1.output.log.2014-02-26
Machine2.output.log.2014-02-27
Machine2.output.log.2014-02-26
Machine2.output.log.2014-02-27
Machine3.output.log.2014-02-26
/;
#filenames =
map $_->{'orig_string'},
sort {
$a->{'date'} cmp $b->{'date'} || $a->{'machine_name'} cmp $b->{'machine_name'}
}
map {
my %attributes;
#attributes{ qw/orig_string machine_name date/ } = /\A(([^.]+)\..*\.([^.]+))\z/;
%attributes ? \%attributes : ()
} #filenames;
You can define your own sort like so ...
my #files = (
"Abc1.xxx.log.2014-02-26"
, "Abc1.xxx.log.2014-02-27"
, "Abc2.xxx.log.2014-02-26"
, "Abc2.xxx.log.2014-02-27"
, "Abc3.xxx.log.2014-02-26"
);
foreach my $i ( #files ) { print "$i\n"; }
sub bydate {
(split /\./, $a)[3] cmp (split /\./, $b)[3];
}
print "sort it\n";
foreach my $i ( sort bydate #files ) { print "$i\n"; }
You can take your pattern 'YYYY-MM-DD' and match it to what you need.
#!/usr/bin/perl
use strict;
opendir (DIRFILES, ".") || die "can not open data file \n";
my #maplist = readdir(DIRFILES);
closedir(MAPS);
my %somehash;
foreach my $tmp (#maplist) {
next if $tmp =~ /^.{1,2}$/;
next if $tmp =~ /test/;
$tmp =~ /(\d{4})-(\d{2})-(\d{2})/;
$somehash{$tmp} = $1 . $2 . $3; # keep the original file name
# allows for duplicate dates
}
foreach my $tmp (keys %somehash) {
print "-->", $tmp, " --> " , $somehash{$tmp},"\n";
}
my #list= sort { $somehash{$a} <=> $somehash{$b} } keys(%somehash);
foreach my $tmp (#list) {
print $tmp, "\n";
}
Works, tested it with touch files.

How can I iterate through nested arrays?

I have created an array as follows
while (defined ($line = `<STDIN>`))
{
chomp ($line);
push #stack,($line);
}
each line has two numbers.
15 6
2 8
how do iterate over each item in each line?
i.e. I want to print
15
6
2
8
I understand it's something like
foreach (#{stack}) (#stack){
print "?????
}
This is where I am stuck.
See the perldsc documentation. That's the Perl Data Structures Cookbook, which has examples for dealing with arrays of arrays. From what you're doing though, it doesn't look like you need an array of arrays.
For your problem of taking two numbers per line and outputting one number per line, just turn the whitespace into newlines:
while( <> ) {
s/\s+/\n/; # turn all whitespace runs into newlines
print; # it's ready to print
}
With Perl 5.10, you can use the new \h character class that matches only horizontal whitespace:
while( <> ) {
s/\h+/\n/; # turn all horizontal whitespace runs into newlines
print; # it's ready to print
}
As a Perl one-liner, that's just:
% perl -pe 's/\h+/\n/' file.txt
#!/usr/bin/perl
use strict;
use warnings;
while ( my $data = <DATA> ) {
my #values = split ' ', $data;
print $_, "\n" for #values;
}
__DATA__
15 6
2 8
Output:
C:\Temp> h
15
6
2
8
Alternatively, if you want to store each line in #stack and print out later:
my #stack = map { [ split ] } grep { chomp; length } <DATA>;
The line above slurps everything coming from the DATA filehandle into a list of lines (because <DATA> happens in list context). The grep chomps each line and filters by length after chomping (to avoid getting any trailing empty lines in the data file -- you can avoid it if there are none). The map then splits each line along spaces, and then creates an anonymous array reference for each line. Finally, such array references are stored in each element of #stack. You might want to use Data::Dumper to look at #stack to understand what's going on.
print join("\n", #$_), "\n" for #stack;
Now, we look over each entry in stack, dereferencing each array in turn, then joining the elements of each array with newlines to print one element per line.
Output:
C:\Temp> h
15
6
2
8
The long way of writing essentially the same thing (with less memory consumption) would be:
my #stack;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
my #values = split ' ', $line;
push #stack, \#values;
}
for my $ref ( #stack ) {
print join("\n", #$ref), "\n";
}
Finally, if you wanted do something other than printing all values, say, sum all the numbers, you should store one value per element of #stack:
use List::Util qw( sum );
my #stack;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
my #values = split ' ', $line;
push #stack, #values;
}
printf "The sum is %d\n", sum #stack;
#!/usr/bin/perl
while ($line = <STDIN>) {
chomp ($line);
push #stack, $line;
}
# prints each line
foreach $line (#stack) {
print "$line\n";
}
# splits each line into items using ' ' as separator
# and prints the items
foreach $line (#stack) {
#items = split / /, $line;
foreach $item (#items) {
print $item . "\n";
}
}
I use 'for' for "C" style loops, and 'foreach' for iterating over lists.
#!/usr/bin/perl
use strict;
use warnings;
open IN, "< read.txt" or
die "Can't read in 'read.txt'!";
my $content = join '', <IN>;
while ($content =~ m`(\d+)`g) {
print "$1\n";
}