This seems like a basic thing to do, but I can't figure out a simple way doing it without starting building lots of arrays etc. so my apologies if this is too simple.
I have a file of this format:
a,x1
a,x2
a,x3
b,x4
c,x5
c,x6
this is an edge list for a very big graph.
I need to convert it to the following format:
a,x1 x2 x3
b,x4
c,x5 x6
(this is another common format of graphs)
Is there a simple way of doing that in perl? you can assume that all the "a" and "b" are sorted, so once you got to a new starting node (say "b") there will be no going back (e.g. no more edges outgoing from "a")
Any advice would be appreciated.
Just keep the last "from" node in a variable that survives iterations of the loop.
#!/usr/bin/perl
use warnings;
use strict;
my $last = q();
while (<>) {
chomp;
my ($from, $to) = split /,/;
if ($from ne $last) {
print "\n" x (1 != $.), $from, ',';
$last = $from;
} else {
print ' ';
}
print $to;
}
print "\n";
"\n" x (1 != $.) prevents the newline being printed before the 1st line.
The same as a one-liner:
perl -aF, -ne 'chomp $F[1]; print "\n" x (1 != $.), "$F[0]," if $l ne $F[0];
print " " x ($l eq $F[0]), $F[1]; $l = $F[0] }{ print "\n"' < input
I'd do this in two stages. One to get the data into a useful data structure and another to print the data.
The data structure I have chosen is a hash of arrays. The keys of the hash are your 'a', 'b' and 'c' and the values are references to arrays containing the 'x1, 'x2', etc.
The code looks like this:
#!/usr/bin/perl
use strict;
use warnings;
# We use modern Perl (specifically say())`
use 5.010;
my %edges;
while (<DATA>) {
chomp;
my ($key, $val) = split /,/;
push #{$edges{$key}}, $val;
}
for (sort keys %edges) {
say join ',', $_, #{$edges{$_}};
}
__DATA__
a,x1
a,x2
a,x3
b,x4
c,x5
c,x6
Related
Here is the script of user Suic for calculating molecular weight of fasta sequences (calculating molecular weight in perl),
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
for my $file (#ARGV) {
open my $fh, '<:encoding(UTF-8)', $file;
my $input = join q{}, <$fh>;
close $fh;
while ( $input =~ /^(>.*?)$([^>]*)/smxg ) {
my $name = $1;
my $seq = $2;
$seq =~ s/\n//smxg;
my $mass = calc_mass($seq);
print "$name has mass $mass\n";
}
}
sub calc_mass {
my $a = shift;
my #a = ();
my $x = length $a;
#a = split q{}, $a;
my $b = 0;
my %data = (
A=>71.09, R=>16.19, D=>114.11, N=>115.09,
C=>103.15, E=>129.12, Q=>128.14, G=>57.05,
H=>137.14, I=>113.16, L=>113.16, K=>128.17,
M=>131.19, F=>147.18, P=>97.12, S=>87.08,
T=>101.11, W=>186.12, Y=>163.18, V=>99.14
);
for my $i( #a ) {
$b += $data{$i};
}
my $c = $b - (18 * ($x - 1));
return $c;
}
and the protein.fasta file with n (here is 2) sequences:
seq_ID_1 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS
SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASG
When using: perl molecular_weight.pl protein.fasta > output.txt
in terminal, it will generate the correct results, however it also presents an error of "Use of unitialized value in addition (+) at molecular_weight.pl line36", which is just localized in line of "$b += $data{$i};" how to fix this bug ? Thanks in advance !
You probably have an errant SPACE somewhere in your data file. Just change
$seq =~ s/\n//smxg;
into
$seq =~ s/\s//smxg;
EDIT:
Besides whitespace, there may be some non-whitespace invisible characters in the data, like WORD JOINER (U+2060).
If you want to be sure to be thorough and you know all the legal symbols, you can delete everything apart from them:
$seq =~ s/[^ARDNCEQGHILKMFPSTWYV]//smxg;
Or, to make sure you won't miss any (even if you later change the symbols), you can populate a filter regex dynamically from the hash keys.
You'd need to make %Data and the filter regex global, so the filter is available in the main loop. As a beneficial side effect, you don't need to re-initialize the data hash every time you enter calc_mass().
use strict;
use warnings;
my %Data = (A=>71.09,...);
my $Filter_regex = eval { my $x = '[^' . join('', keys %Data) . ']'; qr/$x/; };
...
$seq =~ s/$Filter_regex//smxg;
(This filter works as long as the symbols are single character. For more complicated ones, it may be preferable to match for the symbols and collect them from the sequence, instead of removing unwanted characters.)
The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);
Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that
You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.
FILE:
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
code:
use IPC::System::Simple qw( capture capturex );
use POSIX;
my $tb1_file = '/var/egridmanage_pl/daily_pl/egrid-csv/test.csv';
open my $fh1, '<', $tb1_file or die qq{Unable to open "$tb1_file" for input: $!};
my #t1_temp_12 = map {
chomp;
my #t1_ft_12 = split /,/;
sprintf "%.0f", $t1_ft_12[6] if $t1_ft_12[2] eq '12:00:00';
} <$fh1>;
print "TEMP #t1_temp_12\n";
my $result = #t1_temp_12 - 273.14;
print "$result should equal something closer to 24 ";
$result value prints out -265.14 making me think the #t1_temp_12 is hashed
So I tried to do awk
my $12temp = capture("awk -F"," '$3 == "12:00:00" {print $7 - 273-.15}' test.csv");
I've tried using ``, qx, open, system all having the same error result using the awk command
But this errors out. When executing awk at command line i get the favoured results.
This looks like there's some cargo cult programming going on here. It looks like all you're trying to do is find the line for 12:00:00 and print the temperature in degrees C rather than K.
Which can be done like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
my #fields = split /,/;
print $fields[6] - 273.15 if $fields[2] eq "12:00:00";
}
__DATA__
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
Prints:
25.72
You don't really need to do map sprintf etc. (Although you could do a printf on that output if you do want to format it).
Edit: From the comments, it seems one of the sources of confusion is extracting an element from an array. An array is zero or more scalar elements - you can't just assign one to the other, because .... well, what should happen if there isn't just one element (which is the usual case).
Given an array, we can:
pop #array will return the last element (and remove it from the array) so you could my $result = pop #array;
[0] is the first element of the array, so we can my $result = $array[0];
Or we can assign one array to another: my ( $result ) = #array; - because on the left hand side we have an array now, and it's a single element - the first element of #array goes into $result. (The rest isn't used in this scenario - but you could do my ( $result, #anything_else ) = #array;
So in your example - if what you're trying to do is retrieve a value matching a criteria - the normal tool for the job would be grep - which filters an array by applying a conditional test to each element.
So:
my #lines = grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print "#lines";
print $lines[0];
Which we can reduce to:
my ( $firstresult ) = grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print $firstresult;
But as we want to want to transform our array - map is the tool for the job.
my ( $result ) = map { (split /,/)[6] - 273.15 } grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print $result;
First we:
use grep to extract the matching elements. (one in this case, but doesn't necessarily have to be!)
use map to transform the list, so that that we turn each element into just it's 6th field, and subtract 273.15
assign the whole lot to a list containing a single element - in effect just taking the first result, and throwing the rest away.
But personally, I think that's getting a bit complicated and may be hard to understand - and would suggest instead:
my $result;
while (<DATA>) {
my #fields = split /,/;
if ( $fields[2] eq "12:00:00" ) {
$result = $fields[6] - 273.15;
last;
}
}
print $result;
Iterate your data, split - and test - each line, and when you find one that matches the criteria - set $result and bail out of the loop.
#t1_temp_12 is an array. Why are you trying to subtract an single value from it?
my $result = "#t1_temp_12 - 273.14";
Did you want to do this instead?
#t1_temp_12 = map {$_ - 273.14} #t1_temp_12;
As a shell one-liner, you could write your entire script as:
perl -F, -lanE 'say $F[6]-273.14 if $F[2] eq "12:00:00"' <<DATA
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
DATA
25.73
I have an input file like so, separated by newline characters.
AAA
BBB
BBA
What would be the most efficient way to count the columns (vertically), first with first, second with second etc etc.
Sample OUTPUT:
ABB
ABB
ABA
I have been using the following, but am unable to figure out how to remove the scalar context from it. Any hints are appreciated:
while (<#seq_prot>){
chomp;
my #sequence = map substr (#seq_prot, 1, 1), $start .. $end;
#sequence = split;
}
My idea was to use the substring to get the first letter of the input (A in this case), and it would cycle for all the other letters (The second A and B). Then I would increment the cycle number + 1 so as to get the next line, until I reached the end. Of course I can't seem to get the first part going, so any help is greatly appreciated, am stumped on this one.
Basically, you're trying to transpose an array.
This can be done easily using Array::Transpose
use warnings;
use strict;
use Array::Transpose;
die "Usage: $0 filename\n" if #ARGV != 1;
for (transpose([map {chomp; [split //]} <>])) {
print join("", map {$_ // " "} #$_), "\n"
}
For an input file:
ABCDEFGHIJKLMNOPQRS
12345678901234
abcdefghijklmnopq
ZYX
Will output:
A1aZ
B2bY
C3cX
D4d
E5e
F6f
G7g
H8h
I9i
J0j
K1k
L2l
M3m
N4n
O o
P p
Q q
R
S
You'll have to read in the file once for each column, or store the information and go through the data structure later.
I was originally thinking in terms of arrays of arrays, but I don't want to get into References.
I'm going to make the assumption that each line is the same length. Makes it simpler that way. We can use split to split your line into individual letters:
my = $line = "ABC"
my #split_line = split //, $line;
This will give us:
$split_line[0] = "A";
$split_line[1] = "B";
$split_line[2] = "C";
What if we now took each letter, and placed it into a #vertical_array.
my #vertical_array;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
Now let's do this with the next line:
$line = "123";
#split_line = split //, $line;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
This will give us:
$vertical_array[0] = "A1";
$vertical_array[1] = "B2";
$vertical_array[2] = "C3";
As you can see, I'm building the $vertical_array with each interation:
use strict;
use warnings;
use autodie;
use feature qw(say);
my #vertical_array;
while ( my $line = <DATA> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( 0..$#split_line ) {
$vertical_array[$index] .= $split_line[$index];
}
}
#
# Print out your vertical lines
#
for my $line ( #vertical_array ) {
say $line;
}
__DATA__
ABC
123
XYZ
BOY
FOO
BAR
This prints out:
A1XBFB
B2YOOA
C3ZYOR
If I had used references, I could probably have built an array of arrays and then flipped it. That's probably more efficient, but more complex. However, that may be better at handling lines of different lengths.
So, i have a file to read like this
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
What I want is if $x matches with Some.Text for example, how can I get a variable with "Some big text with spaces and numbers and something" or if it matches with "Some.Text2" to get "Again some big test, etc".
open FILE, "<cats.txt" or die $!;
while (<FILE>) {
chomp;
my #values = split('~~~', $_);
foreach my $val (#values) {
print "$val\n" if ($val eq $x)
}
exit 0;
}
close FILE;
And from now on I don't know what to do. I just managed to print "Some.text" if it matches with my variable.
splice can be used to remove elements from #values in pairs:
while(my ($matcher, $printer) = splice(#values, 0, 2)) {
print $printer if $matcher eq $x;
}
Alternatively, if you need to leave #values intact you can use a c style loop:
for (my $i=0; $i<#values; $i+=2) {
print $values[$i+1] if $values[$i] eq $x;
}
Your best option is perhaps not to split, but to use a regex, like this:
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
while (/Some.Text2?~~~(.+?)~~~/g) {
say $1;
}
}
__DATA__
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
Output:
Some big text with spaces and numbers and something
Again some big test, etc