Use keys and pairing elements Perl - perl

My data looks like this:
G1 G2 G3 G4
Pf1 NO B1 NO D1
Pf2 NO NO C1 D1
Pf3 A1 B1 NO D1
Pf4 A1 NO C1 D2
Pf5 A3 B2 C2 D3
Pf6 NO B3 NO D3
My purpose is to check in each column if an element (different from the "NO" cases) is showed twice (like A1 in column 2, for example) and only twice (if it is showed three times or more I don't want it in the output) and, if so, write the correspondent elements of the first column. So, the desired output looks like this:
Pf3 Pf4 A1
Pf1 Pf3 B1
Pf2 Pf4 C1
Pf5 Pf6 D3
I'm trying to write a perl script, but I need some help to focus on the different steps. This is what I did so far:
open (HAN, "< $file_in") || die "Impossible open the in_file";
#r = <HAN>;
close (HAN);
for ($i=0; $i<=$#r; $i++){
chomp ($r[$i]);
($Ids, #v) = split (/\t/, $r[$i]);
}
}
But I cannot go on in any direction!
(My perl knowledge needs to be pushed by you!)
The hot points in my mind are:
how do I compare elements from the same column (or anyway in the same file)?
how can I associate the elements of the first column with the other column ones (may be keys)?
Any help is absolutely necessary and welcome!

use Data::Dumper;
my %hash;
while (<DATA>) {
next if $.==1;
chomp;
my ($first,#others) = (split /\s+/);
for (#others){
$hash{$_}.=' '.$first;
}
}
print Dumper \%hash;
__DATA__
G1 G2 G3 G4
Pf1 NO B1 NO D1
Pf2 NO NO C1 D1
Pf3 A1 B1 NO D1
Pf4 A1 NO C1 D2
Pf5 A3 B2 C2 D3
Pf6 NO B3 NO D3
What I use here? (tricks)
while (<DATA>){BLOCK} - read data from specific DATA section in Perl script file. (yes, you can put test data here, if you want. But don't store everything! this is not a bin!)
next if $.==1 - $. - special variable, that store a line number of input data. like 'index'.
chomp; - back to while(<DATA>).
Some variables in Perl are hidden. In functions - #_ array of input parameters. And always Perl programmers like to use $_ - You variable.
And this while(<DATA>) really a hidden while(defined($_ = <DATA>)).
Function chomp use hidden-You variable and try to chop \n symbol at the end.
Function split /REGEX/ also take as default variable hidden-You variable ($_).

Perl multi liner :),
perl -anE '
/^\S/ or next;
$k = shift #F;
push #{$t{$_}}, $k for#F;
}{
#$_-1==2 and say join" ",#$_ for map [#{$t{$_}},$_], sort keys%t;
' file

Related

Why I can 't delete the empty row?

The file A contains one row like below,
SNP1 AA TT GGSNP2 GG CC AASNP3 GG CC AA...
I want to change the format and make it like this,
SNP1 TT AA TT GG CC AA
SNP2 AA GG CC TT GG CC
SNP3 GG AA TT TT CC TT
...
(each row began with SNP_)
I had written a perl script to replace SNP by \nSNP, but the first row of the output file was always empty though I had done some try to delete the first row.
So, Is there any suggestion for me? Or another way to get the final output file.
Thanks a lot.
Just make sure there is a character before every SNP that you alter:
s/.\K(?=SNP)/\n/g
open $file_in...
open $file_out...
local $/ = 'SNP';
<$file_in>; # discard empty row
while ( my $record = <$file_in> ) {
chomp $record;
# make changes to $record, which will start as e.g. "1 AA TT GG"
....
print $file_out "SNP$record\n";
}

Perl: read file and re-arrange into columns

I have a file that i want to read in which has the following structure:
EDIT: i made the example a bit more specific to clarify what i need
HEADER
MORE HEADER
POINTS 2
x1 y1 z1
x2 y2 z2
VECTORS velocities
u1 v1 w1
u2 v2 w2
VECTORS displacements
a1 b1 c1
a2 b2 c2
The number of blocks containing some data is arbitrary, so is their order.
i want to read only data under "POINTS" and under "VECTORS displacements" and rearrange them in the following format:
x1 y1 z1 a1 b1 c1
x2 y2 z2 a2 b2 c2
I manage to read the xyz and abc blocks into separate arrays but my problem is to combine them into one.
I should mention that i am a perl newbie. Can somebody help me?
This is made very simple using the range operator. The expression
/DATA-TO-READ/ .. /DATA-NOT-TO-READ/
evaluates to 1 on the first line of the range (the DATA-TO-READ line), 2 on the second etc. On the last line (the DATA-NOT-TO-READ line) E0 is appended to the count so that it evaluates to the same numeric value but can also be tested for being the last line. On lines outside the range it evaluates to a false value.
This program accumulates the data in array #output and prints it when the end of the input is reached. It expects the path to the input file as a parameter on the command line.
use strict;
use warnings;
my (#output, $i);
while (<>) {
my $index = /DATA-TO-READ/ .. /DATA-NOT-TO-READ/;
if ($index and $index > 1 and $index !~ /E/) {
push #{ $output[$index-2] }, split;
}
}
print "#$_\n" for #output;
output
x1 y1 z1 a1 b1 c1
x2 y2 z2 a2 b2 c2
I only used 1 array to remember the first 3 columns. You can output directly when processing the second part of the data.
#!/usr/bin/perl
use strict;
use warnings;
my #first; # To store the first 3 columns.
my $reading; # Flag: are we reading the data?
while (<>) {
next unless $reading or /DATA-TO-READ/; # Skip the header.
$reading = 1, next unless $reading; # Skip the DATA-TO-READ line, enter the
# reading mode.
last if /DATA-NOT-TO-READ/; # End of the first part.
chomp; # Remove a newline.
push #first, $_; # Remember the line.
}
undef $reading; # Restore the flag.
while (<>) {
next unless $reading or /DATA-TO-READ/;
$reading = 1, next unless $reading;
last if /DATA-NOT-TO-READ/;
print shift #first, " $_"; # Print the remembered columns + current line.
}

How to append new values to hash in Perl

I have a hash of hashes where, at the last level, I want each value to be appended - not updated - if that value already exists. What would be the best way to do this? I was thinking about making the values as lists, but this is probably not the most efficient way...
Here's where I got so far:
#!/usr/local/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $dir='D:\';
open my $data,"<","$dir\\file.txt";
my (#selecteddata,#array,%hash);
while (<$data>) {
chomp $_;
my #line= split "\t";
$hash{$line[1]}{$line[2]}=$line[0];
warn Dumper \%hash;
}
close $data;
Note, this code updates the values at last level with value $line[0], but if the key $line[4] already exists (meaning, it already has a previous value $line[0]) I want this value to be appended and not updated.
So, ideally, for the following (tab sepparated) list:
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b4 c4
The hash would look something like this - I don't know exactly how the grouping of a4 and a5 should look like, so as long as they are grouped it should be ok:
{
'b1' => {'c1' => 'a1'},
'b2' => {'c2' => 'a2'},
'b3' => {'c3' => 'a3'},
'b4' => {'c4' => 'a4, a5'}
}
You can append your string,
$_ = defined($_) ? "$_, $line[0]" : $line[0]
for $hash{$line[1]}{$line[2]};
or use array which is better suited for storing list of elements,
push #{ $hash{$line[1]}{$line[2]} }, $line[0];

Sed, awk, Perl or other for de-interleaving text file

I would like a relatively compact command to perform line-by-line de-interleaving of a text file, i.e
a1
a2
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
d1
d2
d3
d4
maps to
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4
The interleaving depth should be adjustable. The lines themselves do not contain any useful structure to assist with the process, and the example above is just a toy example for demonstration purposes. What tool can I use to do this?
sort can do it!
$ sort -k1.2 your_file
-k1.2 sorts by first field starting from 2nd character.
Output:
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4
Basically, what you're looking at doing is reading your data into a 2D array. As you read it in, you can (for example) put the data into the array row by row.
Then when you write the data out, you traverse the array column by column. Adjusting the (de-)interleaving you do just requires a different size of array (or at least that you use a different amount of it, though you could leave the array size itself fixed, if you chose).
According to your new requirements, reordering elements based on their position in the file:
use strict;
use warnings;
my #sorted;
my $depth = 4; # the adjustable interleaving depth
while (<DATA>) {
my $num = ($. % $depth) - 1; # $. is input line number
push #{ $sorted[$num] }, $_;
}
for (#sorted) {
print #$_;
}
__DATA__
a1
a2
a3
a4
b1
b2
b3
b4
c1
c2
c3
c4
d1
d2
d3
d4
Note that the script can be tested on an input file by changing <DATA> to <> and running:
perl script.pl input.txt
Update
Having finally understood your question, thanks to TLP, I suggest this solution. It expects the depth and the input file name on the command line:
$ perl deinter.pl 4 interleaved.txt
and prints the reordered data to STDOUT.
use strict;
use warnings;
my $depth = shift;
my #data = <>;
for my $start (0 .. $depth-1) {
for (my $i = $start; $i < #data; $i += $depth) {
print $data[$i];
}
}
output
a1
b1
c1
d1
a2
b2
c2
d2
a3
b3
c3
d3
a4
b4
c4
d4
Previous solution
Here is a technique that reads the whole file into memory, builds a set of keys for comparison, and sorts the indices of the data so that they can be printed in the new order.
It can be changed for your purposes by modifying the regular expression that extracts the keys fields, and changing the sort block so that the sorted order is correct.
If your file is enormous then it may be necessary to build only the array of keys in memory, and leave the rest of the data on file to be read as it is output.
use strict;
use warnings;
open my $fh, '<', 'interleaved.txt' or die $!;
my #data = <$fh>;
my #keys = map [ /^(.)(.)/ ], #data;
my #sorted = sort {
$keys[$a][1] <=> $keys[$b][1] or
$keys[$a][0] cmp $keys[$b][0]
} 0 .. $#keys;
print $data[$_] for #sorted;
This might work for you (GNU sed and sort):
sed '1{x;s/^/1/;x};G;s/\n/\t/p;x;y/1234/2341/;x;d' file|sort -sk2|sed 's/\t.*//'
I'd like to credit Borodin and TLP for their inputs and answers, which inspired the solution. Its ugly, but I like it
awk 'BEGIN{v=4}{now=(NR-1)%v; STOR[now] = STOR[now] "\n" $0;} END {for (v in STOR) print STOR[v]}'
It also has the flaw of printing spurious newlines (well, the ones appended to the start of the array), but I can deal with that.
EDIT:
Solution for the newlines:
awk 'BEGIN{v=4}{now=(NR-1)%v; STOR[now] = STOR[now] "\n" $0;} END {for (v in STOR) print substr(STOR[v],2)}'

How to Rewrite of One Line Code (or Less Line Code in command line) of this code in Perl?

I have a code like that:
#!/usr/bin/perl
use strict;
use warnings;
my %proteins = qw/
UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
/;
open(INPUT,"<dna.txt");
while (<INPUT>) {
tr/[a,c,g,t]/[A,C,G,T]/;
y/GCTA/CGAU/;
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
}
}
close(INPUT);
This code is related to my other question's answer: DNA to RNA and Getting Proteins with Perl
The output of the program is:
SIMQNISGREAT
How can I rewrite that code with Perl, it will run on command line and it will be rewritten with less code(if possible one line code)?
PS 1: dna.txt is like that:
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
PS 2: If the code will be less line, it is accepted to write the my %proteins variable into a file.
The only changes I would recommend making are simplifying your while loop:
while (<INPUT>) {
tr/acgt/ACGT/;
tr/GCTA/CGAU/;
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
}
}
Since y and tr are synonyms, you should only use one of them. I think tr reads better than y, so I picked tr. Further, you were calling them very differently, but this should be the same effect and only mentions the letters you actually change. (All the other characters were being transposed to themselves. That makes it much harder to see what is actually being changed.)
You might want to remove the open(INPUT,"<dna.txt"); and corresponding close(INPUT); lines, as they make it much harder to use your program in shell pipelines or with different input files. But that's up to you, if the input file will always be dna.txt and never anything different, this is alright.
Somebody (#kamaci) called my name in another thread. This is the best I can come up with while keeping the protein table on the command line:
perl -nE'say+map+substr("FYVDINLHL%VEMKLQL%VEIKLQFYVDINLHCSGASTRPWSGARTRP%SGARTRPCSGASTR",(s/GGG/GGC/i,vec($_,0,32)&101058048)%63,1),/.../g' dna.txt
(Shell quoting, for Windows quoting swap ' and " characters). This version marks invalid codons with %, you can probably fix that by adding =~y/%//d at an appropriate spot.
Hint: This picks out 6 bits from the raw ASCII encoding of an RNA triple, giving 64 codes between 0 and 101058048; to get a string index, I reduce the result modulo 63, but this creates one double mapping which regrettably had to code two different proteins. The s/GGG/GGC/i maps one of them to another that codes the right protein.
Also note the parentheses before the % operator which both isolate the , operator from the argument list of substr and fix the precedence of & vs %. If you ever use that in production code, you're a bad, bad person.
#!/usr/bin/perl
%p=qw/UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G/;
$_=uc<DATA>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
__DATA__
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
Phew. Best I can come up with, at least this quickly. If you're sure the input is always already in uppercase, you can also drop the uc saving another two characters. Or if the input is always the same, you could assign it to $_ straight away instead of reading it from anywhere.
I guess I don't need to say that this code should not be used in production environments or anywhere else other than pure fun. When doing actual programming, readability almost always wins over compactness.
A few other versions I mentioned in the comments:
Reading %p and the DNA from files:
#!/usr/bin/perl
open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;
open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
From shell with perl -e:
perl -e 'open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g'
Most things have already been pointed out, especially that readability matters. I wouldn't try to reduce the program more than what follows.
use strict;
use warnings;
# http://stackoverflow.com/questions/5402405/
my $fnprot = shift || 'proteins.txt';
my $fndna = shift || 'dna.txt';
# build protein table
open my $fhprot, '<', $fnprot or die "open $fnprot: $!";
my %proteins = split /\s+/, do { local $/; <$fhprot> };
close $fhprot;
# process dna data
my #result;
open my $fhdna, '<', $fndna or die "open $fndna: $!";
while (<$fhdna>) {
tr/acgt/ACGT/;
tr/GCTA/CGAU/;
push #result, map $proteins{$_}, grep defined $proteins{$_}, m/(...)/g;
}
close $fhdna;
# check correctness of result (given input as per original post)
my $expected = 'SIMQNISGREAT';
my $got = join '', #result;
die "#result is not expected" if $got ne $expected;
print "#result - $got\n";
The only "one-liner" thing I added is the push map grep m//g in the while loop. Note that Perl 5.10 adds the "defined or" operator - // - which allows you to write:
push #result, map $proteins{$_} // (), m/(...)/g;
Ah okay, the open do local $/ file slurp idiom is handy for slurping small files into memory. Hope you find it a bit inspiring. :-)
If write proteins data to another file, space delimited and without line break. So, you can import data by reading file once time.
#!/usr/bin/perl
use strict;
use warnings;
open(INPUT, "<mydata.txt");
open(DATA, "<proteins.txt");
my %proteins = split(" ",<DATA>);
while (<INPUT>) {
tr/GCTA/CGAU/;
while(/(\w{3})/gi) {print $proteins{$1} if (exists($proteins{$1}))};
}
close(INPUT);
close(DATA);
You can remove line of code "tr/a,c,g,t/A,C,G,T/" because match operator has option for case insensitive (i option). And original foreach loop can be optimized like code above. $1 variable here is matched pattern result inside parentheses of match operation /(\w{3})/gi