Splitting an Array into n accessible parts within perl? - perl

My goal is to take an array of letters and cut it up into "n" parts. In this case no more than 10 letters each piece. But I want these arrays to be stored into an array reference which I can access on a counter.
For example, I have the following script to split an array of English alphabetical letters into 1 array of 10 letters. But since the English Alphabet has 26 letters, I need 2 more arrays to access in an array reference.
#!/usr/bin/env perl
#split an array into parts.
use strict;
use warnings;
use feature 'say';
my #letters = ('A' .. 'Z');
say "These are my letters:";
for(#letters){print "$_ ";}
my #letters_selected = splice(#letters, 0, 10);
say "\nThese are my selected letters:";
for(#letters_selected){print "$_ ";}
The output is this:
These are my letters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
These are my selected letters:
A B C D E F G H I J
This little script only gives me one piece of 10 letters of the alphabet. But I want all three pieces of 10 letters of the alphabet, so I would like to know how I can achieve this:
Goal:
Have an array reference called letters_selected of letters which contains all letters A - Z. But ... I can access all three pieces of size less than or equal to 10 letters like this.
foreach(#{$letters_selected[0]}){say "$_ ";}
returns: A B C D E F G H I J # These are the initial 10 elements of the alphabet.
foreach(#{$letters_selected[1]}){say "$_ ";}
returns: K L M N O P Q R S T # The next 10 after that.
foreach(#{$letters_selected[2]}){say "$_ ";}
returns: U V W X Y Z # The next no more than 10 after that.

Since splice is destructive to its target you can keep applying it
use warnings;
use strict;
use feature 'say';
my #letters = 'A'..'Z';
my #letter_groups;
push #letter_groups, [ splice #letters, 0, 10 ] while #letters;
say "#$_" for #letter_groups;
After this #letters is empty. So make a copy of it and work with that if you will need it.
Every time through, splice removes and returns elements from #letters and [ ] makes an anonymous array of that list. This reference is pushed on #letter_groups.
Since splice takes as many elements as there are (if there aren't 10) once fewer than 10 remain splice removes and returns that, the #letters gets emptied, and while terminates.

Related

Perl : $_ and $array[$_] issue

Below is my code , please review.
use strict;
my #people = qw{a b c d e f};
foreach (#people){
print $_,"$people[$_]\n";
}
Below is the output,
[~/perl]$ perl test.pl
aa #why the output of $people[$_] is not same with $_?
ba
ca
da
ea
fa
Thanks for your asking.
$_ is the actual element you're looking at. $people[$_] is getting the $_th element out of #people. It's intended to be used with numerical indices, so it numifies its argument. Taking a letter and "converting" it to a number just converts to zero since it can't parse a number out of it. So $people[$_] is just $people[0] when $_ is not a number. If you try the same experiment where some of the list elements are actually numerical, you'll get some more interesting results.
Try:
use strict;
my #people = qw{a b c 3 2 1};
foreach (#people){
print $_,"$people[$_]\n";
}
Output:
aa
ba
ca
33
2c
1b
Since, in the first three cases, we couldn't parse a, b, or c as numbers so we got the zeroth element. Then, we can actually convert 3, 2, and 1 to numbers, so we get elements 3, 2, and then 1.
EDIT: As mentioned in a comment by #ikegami, put use warnings at the top of your file in addition to use strict to get warned about this sort of thing.

sort uniq(#stuff) doesn't sort, doesn't de-dupe [duplicate]

This question already has answers here:
why does sort with uniq not work together
(2 answers)
Closed 6 years ago.
Some code has me waxing lyrical,
The outcome is slightly hysterical.
sorted output I seek,
All elements unique,
But the results? Far from clinical.
Code in question
use strict;
use warnings;
sub uniq { my %seen; grep ! $seen{$_}++, #_ }
my #test = ();
for ( 1 .. 3 ) {
#test = sort uniq( #test, qw/ d d c c b b a a / );
print "#test\n";
}
Output
d d c c b b a a
d d c c b b a a d d c c b b a a
d d c c b b a a d d c c b b a a d d c c b b a a
The Fix
An extra set of parentheses restores parity:
#test = sort( uniq( #test, qw/ d d c c b b a a / ) ); # a b c d
Running the two lines through -MO=Deparse sheds some light on the effect of the extra parens - it forces the interpreter to treat the RHS as sort LIST instead of sort SUBNAME LIST:
# Doesn't work as intended (sort SUBNAME LIST)
#test = (sort uniq #test, ('d', 'd', 'c', 'c', 'b', 'b', 'a', 'a'));
# Works as intended (sort LIST)
#test = sort(uniq(#test, ('d', 'd', 'c', 'c', 'b', 'b', 'a', 'a')));
My Question
Why is the extra set of parentheses necessary?
uniq returns a list, so I'd expect
sort uniq( #stuff );
to be equivalent to
sort LIST
Although it's rarely used, the first form listed in perldoc -f sort is sort SUBNAME LIST. i.e. the optional second argument to sort is the name of a function to use as the sort comparator. The LIST, of course, may or may not have parentheses as it wants, and whitespace is free, so
sort uniq( #test, qw/ d d c c b b a a / )
means to sort the list (#test, qw/ d d c c b b a a /) with the function uniq as a comparator. Since the result of uniq is independent of $a and $b and it has no prototype, it always returns undef, which sort considers as 0, and sort responds to this assertion that everything is equal by not changing the order of anything (since it's a stable sort, since 5.8 at least).
uniq treated as a sub name because it's an identifier or a qualified identifier that's not also a function name. No actual check is made to see if the sub actually exists (although it would have found the sub to exist in this case).
sort needs to be followed by a function name or something that's not an identifier or qualified identifier to be disqualified from the sort SUBNAME LIST syntax.

How can I split a word into its component letters?

I am working with Perl and I have an array with only one word:
#example = ("helloword")
I want to generate another array in which each element is a letter from the word:
#example2 = ("h", "e", "l"...)
I need to do that because I need to count the numbers of "h", "e"... How can I do this?
To count how many times letter occurred in a string,
print "helloword" =~ tr/h//; # for 'h' letter
otherwise you can split string and assign list to an array,
my #example2 = split //, $example[0];
I don't completely grasp exactly what you need to count, but perhaps you can take pieces from this example, which uses a hash to store the letters and counts of each...
use warnings;
use strict;
my #array = 'helloworld';
my %letters;
$letters{$_}++ for split //, $array[0];
my $total;
while (my ($k, $v) = each %letters){
$total += $v;
print "$k: $v\n";
}
print "Total letters in string: $total\n",
Output:
w: 1
d: 1
l: 3
o: 2
e: 1
r: 1
h: 1
Total letters in string: 10
Try using this code, found here: http://www.comp.leeds.ac.uk/Perl/split.html
#chars = split(//, $word);
You can of course use split(//,"helloworld"), but that's not as efficient as unpack. Figuring out the template to provide to unpack can be somewhat steep, but this should work for you: unpack('(A)*',"helloworld"). For example:
perl -e 'print(join("\n",unpack("(A)*","helloworld")),"\n")'
h
e
l
l
o
w
o
r
l
d
To count the number of letters, you could either assume that every character of a "word" you split the string up into is a letter and simply evaluate the list in scalar context (or use 'length'), e.g. print(scalar(#letters),"\n"); or print(length(#letters),"\n"), OR you could create a count variable and increment it in a map when a letter pattern is matched, e.g.:
my $cnt = 0;
foreach(#chars){$cnt++ if(/\w/)}
print("$cnt\n");
Or you can use the same evaluation of a list in scalar trick with a grep:
print(scalar(grep {/\w/} #chars),"\n");
There are of course, in perl, other ways to do it.
EDIT: In case I misinterpreted the question, and you want to know how many of each letter there is in the string, then this should suffice:
$cnt = 0;
foreach(unpack("(A)*","helloworld")))
{
next unless(/\w/);
$hash->{$_}->{ORD} = $cnt++ unless(exists($hash->{$_}));
$hash->{$_}->{CNT}++;
}
foreach(sort {$hash->{$a}->{ORD} <=> $hash->{$b}->{ORD}}
keys(%$hash))
{print("$_\t$hash->{$_}->{CNT}\n")}
This solution has the advantage of keeping the unique letters in the order of their first occurrence in the word they were found in.

How to Rewrite of One Line Code (or Less Line Code in command line) of this code in Perl?

I have a code like that:
#!/usr/bin/perl
use strict;
use warnings;
my %proteins = qw/
UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
/;
open(INPUT,"<dna.txt");
while (<INPUT>) {
tr/[a,c,g,t]/[A,C,G,T]/;
y/GCTA/CGAU/;
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
}
}
close(INPUT);
This code is related to my other question's answer: DNA to RNA and Getting Proteins with Perl
The output of the program is:
SIMQNISGREAT
How can I rewrite that code with Perl, it will run on command line and it will be rewritten with less code(if possible one line code)?
PS 1: dna.txt is like that:
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
PS 2: If the code will be less line, it is accepted to write the my %proteins variable into a file.
The only changes I would recommend making are simplifying your while loop:
while (<INPUT>) {
tr/acgt/ACGT/;
tr/GCTA/CGAU/;
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
}
}
Since y and tr are synonyms, you should only use one of them. I think tr reads better than y, so I picked tr. Further, you were calling them very differently, but this should be the same effect and only mentions the letters you actually change. (All the other characters were being transposed to themselves. That makes it much harder to see what is actually being changed.)
You might want to remove the open(INPUT,"<dna.txt"); and corresponding close(INPUT); lines, as they make it much harder to use your program in shell pipelines or with different input files. But that's up to you, if the input file will always be dna.txt and never anything different, this is alright.
Somebody (#kamaci) called my name in another thread. This is the best I can come up with while keeping the protein table on the command line:
perl -nE'say+map+substr("FYVDINLHL%VEMKLQL%VEIKLQFYVDINLHCSGASTRPWSGARTRP%SGARTRPCSGASTR",(s/GGG/GGC/i,vec($_,0,32)&101058048)%63,1),/.../g' dna.txt
(Shell quoting, for Windows quoting swap ' and " characters). This version marks invalid codons with %, you can probably fix that by adding =~y/%//d at an appropriate spot.
Hint: This picks out 6 bits from the raw ASCII encoding of an RNA triple, giving 64 codes between 0 and 101058048; to get a string index, I reduce the result modulo 63, but this creates one double mapping which regrettably had to code two different proteins. The s/GGG/GGC/i maps one of them to another that codes the right protein.
Also note the parentheses before the % operator which both isolate the , operator from the argument list of substr and fix the precedence of & vs %. If you ever use that in production code, you're a bad, bad person.
#!/usr/bin/perl
%p=qw/UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G/;
$_=uc<DATA>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
__DATA__
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
Phew. Best I can come up with, at least this quickly. If you're sure the input is always already in uppercase, you can also drop the uc saving another two characters. Or if the input is always the same, you could assign it to $_ straight away instead of reading it from anywhere.
I guess I don't need to say that this code should not be used in production environments or anywhere else other than pure fun. When doing actual programming, readability almost always wins over compactness.
A few other versions I mentioned in the comments:
Reading %p and the DNA from files:
#!/usr/bin/perl
open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;
open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
From shell with perl -e:
perl -e 'open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g'
Most things have already been pointed out, especially that readability matters. I wouldn't try to reduce the program more than what follows.
use strict;
use warnings;
# http://stackoverflow.com/questions/5402405/
my $fnprot = shift || 'proteins.txt';
my $fndna = shift || 'dna.txt';
# build protein table
open my $fhprot, '<', $fnprot or die "open $fnprot: $!";
my %proteins = split /\s+/, do { local $/; <$fhprot> };
close $fhprot;
# process dna data
my #result;
open my $fhdna, '<', $fndna or die "open $fndna: $!";
while (<$fhdna>) {
tr/acgt/ACGT/;
tr/GCTA/CGAU/;
push #result, map $proteins{$_}, grep defined $proteins{$_}, m/(...)/g;
}
close $fhdna;
# check correctness of result (given input as per original post)
my $expected = 'SIMQNISGREAT';
my $got = join '', #result;
die "#result is not expected" if $got ne $expected;
print "#result - $got\n";
The only "one-liner" thing I added is the push map grep m//g in the while loop. Note that Perl 5.10 adds the "defined or" operator - // - which allows you to write:
push #result, map $proteins{$_} // (), m/(...)/g;
Ah okay, the open do local $/ file slurp idiom is handy for slurping small files into memory. Hope you find it a bit inspiring. :-)
If write proteins data to another file, space delimited and without line break. So, you can import data by reading file once time.
#!/usr/bin/perl
use strict;
use warnings;
open(INPUT, "<mydata.txt");
open(DATA, "<proteins.txt");
my %proteins = split(" ",<DATA>);
while (<INPUT>) {
tr/GCTA/CGAU/;
while(/(\w{3})/gi) {print $proteins{$1} if (exists($proteins{$1}))};
}
close(INPUT);
close(DATA);
You can remove line of code "tr/a,c,g,t/A,C,G,T/" because match operator has option for case insensitive (i option). And original foreach loop can be optimized like code above. $1 variable here is matched pattern result inside parentheses of match operation /(\w{3})/gi

How can I distinguish $_ in nested list operators in Perl?

It is often useful to implement algorithms using nested array operations. For example, to find the number of words in a list that start with each given character, you might do something like this in Python:
>>> a = ["foo","bar","baz"]
>>> map(lambda c: len(filter(lambda w: w.startswith(c), a)), ('a','b','c','d','e','f'))
[0, 2, 0, 0, 0, 1]
In the expression w.startswith(c) it is trivial to distinguish between the two loop iteration variables w and c because they have different names.
In Perl, I would like to do something like this:
#a = ("foo", "bar", "baz");
map length(grep $_ =~ /^$_/, #a), ('a','b','c','d','e','f')
However, the obvious problem with this is that $_ refers only to the innermost grep iteration variable (suitable for the $_ on the left), not the one for the outer map (suitable for the /^$_/). What is the idiomatic way to avoid this problem in Perl?
Just assign to local variable:
#a = qw(foo bar baz);
map {my $ch = $_; scalar grep $_ =~ /^$ch/, #a} qw(a b c d e f)