How can I organize this data using Perl? - perl

I am new to Perl. I have an input file such as:
a 7 5
b 8 2
a 3 2
b 4 1
c 6 1
How can I get output like
column_1_val, number_occurrence_column_1, sum_of_column_2, sum_of_column_3
For example
a 2 10 7
b 2 12 3
c 1 6 1

The program below is a possible solution. I have used the DATA file handle whereas you will presumably need to open an external file and use the handle from that.
use strict;
use warnings;
use feature 'say';
my %data;
while (<DATA>) {
my ($key, #vals) = split;
$data{$key}[0]++;
my $i;
$data{$key}[++$i] += $_ for #vals;
}
say join ' ', $_, #{$data{$_}} for sort keys %data;
__DATA__
a 7 5
b 8 2
a 3 2
b 4 1
c 6 1
output
a 2 10 7
b 2 12 3
c 1 6 1

That would be something like (untested):
while (<>) {
if (m/(\w+)\s+(\d+)\s+(\d+)/) {
($n, $r1, $r2) = ($1, $2, $3);
$nr{$n}++;
$r1{$n} += $r1;
$r2{$n} += $r2;
}
}
for $n (sort keys %nr) {
print "$n $nr{$n} $r1{$n} $r2{$n}\n";
}
This is a very quick-and-dirty way of doing what you described, but it should get you on your way.

Even i am not aware of perl.But in case you are concerned with the result.the below is the solution in awk.It might /might not help you.but in case u need it :
awk '{c[$1]++;a[$1]=a[$1]+$2;b[$1]+=$3}END{for(i in a)print i,c[i],a[i],b[i]}' file3

A slightly different take:
my %records;
while ( <> ) {
my #cols = split ' ';
my $rec = $records{ $cols[0] } ||= {};
$rec->{number_occurrence_column_1}++;
$rec->{sum_of_column_2} += $cols[1];
$rec->{sum_of_column_3} += $cols[2];
}
foreach my $rec ( map { { col1 => $_, %{ $records{ $_ } } }
sort keys %records
) {
print join( "\t"
, #$rec{ qw<col1 number_occurrence_column_1
sum_of_column_2 sum_of_column_3
>
}
), "\n"
;
}

Related

count occurrences of every character per every line position in perl

Similar to question
unix - count occurrences of character per line/field
but for every character in every position on the line.
Given a file of ~500-characters per 1e7 lines,
I want a two dimensional summary structure like
$summary{'a','b','c','0','1','2'}[pos 0..499] = count_integer
that shows the number of times each character was used in each position of the line. Either order of dimensions is fine.
My first method did ++summary{char}[pos] while reading,
but since many lines are the same,
it was much faster to count identical lines first,
then summarize summary{char}[pos] += n at a time
Are there more idiomatic or faster ways than the following C-like 2d loop?
#!perl
my ( %summary, %counthash ); # perl 5.8.9
sub method1 {
print "method1\n";
while (<DATA>) {
my #c = split( // , $_ );
++$summary{ $c[$_] }[$_] foreach ( 0 .. $#c );
} # wend
} ## end sub method1
sub method2 {
print "method2\n";
++$counthash{$_} while (<DATA>); # slurpsum the whole file
foreach my $str ( keys %counthash ) {
my $n = $counthash{$str};
my #c = split(//, $str);
$summary{ $c[$_] }[$_] += $n foreach ( 0 .. $#c );
} #rof my $str
} ## end sub method2
# MAINLINE
if (rand() > 0.5) { &method1 } else { &method2 }
print "char $_ : #{$summary{$_}} \n" foreach ( 'a', 'b' );
# both methods have this output summary
# char a : 3 3 2 2 3
# char b : 2 2 3 3 2
__DATA__
aaaaa
bbbbb
aabba
bbbbb
aaaaa
Depending on how your data is formed method2 might be a bit faster or slower than method 1.
But a big difference would be to use unpack instead of split.
use strict;
use warnings;
my ( %summary, %counthash ); # perl 5.8.9
sub method1 {
print "method1\n";
my #l= <DATA>;
for my $t(1..1000000) {
foreach (#l) {
my #c = split( // , $_ );
++$summary{ $c[$_] }[$_] foreach ( 0 .. $#c );
}
} # wend
} ## end sub method1
sub method2 {
print "method2\n";
++$counthash{$_} while (<DATA>); # slurpsum the whole file
for my $t(1..1000000) {
foreach my $str ( keys %counthash ) {
my $n = $counthash{$str};
my $i = 0;
$summary{ $_ }[$i++] += $n foreach ( unpack("c*",$str) );
}
}
} ## end sub method2
# MAINLINE
#method1();
method2();
print "char $_ : ". join (" ", #{$summary{ord($_)}}). " \n"
foreach ( 'a', 'b' );
# both methods have this output summary
# char a : 3 3 2 2 3
# char b : 2 2 3 3 2
__DATA__
aaaaa
bbbbb
aabba
bbbbb
aaaaa
runs much faster. (6 instead of 7.x seconds on my pc)

Perl Loop Conceptualization

I'm having some trouble trying to figure out how to attack this problem. I have a file that looks like :
1 1 1 1 1 1 2 a
1 1 1 3 4 4 4 a
1 1 1 4 4 4 2 a
2 2 2 3 3 3 2 b
2 2 2 1 1 1 1 b
Which, I want to count how many times the first 3 numbers repeat consecutively. As in I load the data set into perl, and it will spit out:
3
2
as the string '1 1 1' was found at the start of a line 3 times in a row, and the string '2 2 2' was found to begin a line 2 times in a row. Either string may appear later in the file, which complicates things. As well, I will not know what the beginning string is either, and it varies.
Honestly, I'm just really confused on how to attack this. If anyone can help conceptualize this/ give me some pseudo-code to help, that would be great.
EDIT: As per Borodins answer, which works perfectly for what I asked. However, if I wanted to print the count plus which letter it is, How could I do that? So far,
my ( $inp, $outp) = qw / OUT2 OUTFILE/;
open my $input, '<', $inp or die;
open my $output, '>', $outp or die;
my ($last_key, $count);
while ( <$input> ) {
my $key = join ' ', (split)[0..2];
my $id = join ' ', (split)[7];
if ( defined $last_key and $key eq $last_key ) {
++$count;
}
else {
printf "%s %d $id\n", $last_key, $count if defined $last_key;
$last_key = $key;
$count = 1;
}
printf "%s %d $id\n", $last_key, $count if eof;
}
Which gives :
1 1 1 3 b
2 2 2 2 b
which isn't quite what I'm aiming for.
Thanks!
EDIT2
Got what I wanted working. Oftentimes, all it takes is asking for help to figure it out yourself.
Updated code:
my ( $inp, $outp) = qw / OUT2 OUTFILE/;
open my $input, '<', $inp or die;
open my $output, '>', $outp or die;
my ($last_key, $count, $last_id);
while ( <$input> ) {
my $key = join ' ', (split)[0..2];
my $id = join ' ', (split)[7];
if ( defined $last_key and $key eq $last_key ) {
++$count;
}
else {
printf "%s %d $last_id\n", $last_key, $count if defined $last_key;
$last_key = $key;
$count = 1;
$last_id = $id;
}
printf "%s %d $id\n", $last_key, $count if eof;
}
on:
1 1 1 1 1 1 2 a
1 1 1 3 4 4 4 a
1 1 1 4 4 4 2 a
2 2 2 3 3 3 2 b
2 2 2 1 1 1 1 b
3 3 3 2 5 4 2 c
gives:
1 1 1 3 a
2 2 2 2 b
3 3 3 1 c
Thanks all!
You just have to keep track of the previous line, or at least the relevant part of it, and how many times it was seen:
use strict;
use warnings;
my $count = 0;
my $last_prefix = '';
my $last_value = '';
while (my $line = <>) {
my ($prefix, $value) = $line =~ /^(\S+\s+\S+\s+\S+).*(\S+)/ or die "malformed line $line";
if ($prefix ne $last_prefix) {
if ($count) {
print "$count $last_value\n";
}
$last_prefix = $prefix;
$last_value = $value;
$count = 0;
}
++$count;
}
if ($count) {
print "$count $last_value\n";
}
This is just a matter of forming a key from the first three fields and counting the number of times they occur, printing a line of output whenever the key changes or the end of the file is reached
use strict;
use warnings;
my ($last_key, $count);
while ( <DATA> ) {
my $key = join ' ', (split)[0..2];
if ( defined $last_key and $key eq $last_key ) {
++$count;
}
else {
printf "%s -> %d\n", $last_key, $count if defined $last_key;
$last_key = $key;
$count = 1;
}
printf "%s -> %d\n", $last_key, $count if eof;
}
__DATA__
1 1 1 1 1 1 2 a
1 1 1 3 4 4 4 a
1 1 1 4 4 4 2 a
2 2 2 3 3 3 2 b
2 2 2 1 1 1 1 b
3 3 3 1 1 1 1 c
output
1 1 1 -> 3
2 2 2 -> 2
3 3 3 -> 1
Update
To include the final column in the output data, just change
my $key = join ' ', (split)[0..2]
to
my $key = join ' ', (split)[0..2,-1]
output
1 1 1 a -> 3
2 2 2 b -> 2
3 3 3 c -> 1
Here is one way of doing it:
# Open file and loop through lines
open (INFH, '<', "num.txt");
my $count = 0;
my $str;
my %countHash;
while(<INFH>){
# split the line using space characters to get first three numbers
my #numArray = split(' ', $_);
#Concatenating first three numbers as a string to use as key
$key = "$numArray[0]" . "$numArray[1]" . "$numArray[2]";
#If the combination exists, update the value by adding 1. Else add new
if (!exists $countHash{$key}){
$countHash{$key} = 1;
}else{
$countHash{$key} += 1;
}
}
print %countHash;
I will update if I can make it any better.

Read specific column in perl

I'm new in perl. I have below text file and from there I want only one Time column and next columns are values. How can I create a text file with my desire output in perl.
Time Value Time Value Time Value
1 0.353366497 1 0.822193251 1 0.780866396
2 0.168834182 2 0.865650713 2 0.42429447
3 0.323540698 3 0.865984245 3 0.856875894
4 0.721728497 4 0.634773162 4 0.563059042
5 0.545131335 5 0.029808531 5 0.645993399
6 0.143720835 6 0.949973296 6 0.14425803
7 0.414601876 7 0.53421424 7 0.826148814
8 0.194818367 8 0.942334356 8 0.837107013
9 0.291448263 9 0.242588271 9 0.939609775
10 0.500159997 10 0.428897293 10 0.41946448
I've tried below code:
use strict;
use warnings;
use IO::File;
my $result;
my #files = (q[1.txt],q[2.txt],q[3.txt]);
my #fhs = ();
foreach my $file (#files) {
my $fh = new IO::File $file, O_RDONLY;
push #fhs, $fh if defined $fh;
}
while(1) {
my #lines = map { $_->getline } #fhs;
last if grep { not defined $_ } #lines[0..(#fhs-1)];
my #result=join(qq[\t], map { s/[\r?\n]+/ /g; $_ } #lines ) . qq[\r\n];
open (MYFILE, '>>Result.txt');
print (MYFILE "#result");
close (MYFILE);
}
I'd go with split.
use warnings;
use strict;
open (my $f, '<', 'your-file.dat') or die;
while (my $line = <$f>) {
my #elems = split ' ', $line;
print join "\t", #elems[0,1,3,5];
print "\n";
}
This is a one-liner; no need to write a script:
$ perl -lanE '$,="\t"; say #F[0,1,3,5]' 1.txt 2.txt 3.txt
If you like, you can shorten it to:
$ perl -lanE '$,="\t"; say #F[0,1,3,5]' [123].txt
Right now, you're just concatenating the lines of the files together. If that doesn't give you the output you like, you need to chop some columns out.
Since your output looks like you have tab delimited files as input, I split the lines coming in by tabs. And since you only wanted the second column, I only take the column at the first offset from the split.
my $line_num = 0;
while(1) {
my #lines = map { $_->getline } #fhs;
last if grep { not defined $_ } #lines[0..$#fhs];
$line_num++;
my #rows = map { [ split /\t/ ] } #lines;
my $time_val = $rows[0][0];
die "Time values are not all equal on line #$line_num!"
if grep { $time_val != $_->[0] } #rows
;
my $result = join( q[\t], $time_val, map { $_->[1] } #rows );
open (MYFILE, '>>Result.txt');
print (MYFILE "$result\n");
close (MYFILE);
}
Of course, there is no reason to do custom coding to split delimited columns:
use Text::CSV;
...
my $csv = Text::CSV->new( { sep_char => "\t" } );
while(1) {
my #rows = map { $csv->getline( $_ ) } #fhs;
last if grep { not defined $_ } #rows[0..$#fhs];
my ( $time_val, #time_vals ) = map { $_->[0] } #rows;
my #values = map { $_->[1] } #rows;
die "Time values are not all equal on line #$line_num!"
if grep { $time_val != $_ } #time_vals
;
my $result = join( q[\t], $time_val, #values );
...
}
use strict;
use warnings;
open(FH,"<","a.txt");
print "=========== A File content =========== \n";
my $a = `cat a.txt`;
print "$a\n";
my #temp = <>;
my (#arr, #entries, #final);
foreach ( #temp ) {
#arr = split ( " ", $_ );
push #entries, #arr;
}
close FH;
my #entries1 = #entries;
for(my $i = 7; $i<=$#entries; $i=$i+2) {
push #final, $entries[$i];
}
my $size = scalar #final;
open FH1, ">", "b.txt";
print FH1 "Time \t Value\n";
for(my $i = 0; $i < $size; $i++) {
my $j = $i+1;
print FH1 "$j \t $final[$i]\n";
}
close FH1;
print "============ B file content ===============\n";
my $b = `cat b.txt`;
print "$b";
O/P:
=========== A File content ===========
Time Value Time Value Time Value
1 0.353366497 1 0.822193251 1 0.780866396
2 0.168834182 2 0.865650713 2 0.42429447
3 0.323540698 3 0.865984245 3 0.856875894
4 0.721728497 4 0.634773162 4 0.563059042
5 0.545131335 5 0.029808531 5 0.645993399
6 0.143720835 6 0.949973296 6 0.14425803
7 0.414601876 7 0.53421424 7 0.826148814
8 0.194818367 8 0.942334356 8 0.837107013
9 0.291448263 9 0.242588271 9 0.939609775
10 0.500159997 10 0.428897293 10 0.41946448
============ B file content ===============
Time Value
1 0.353366497
2 0.822193251
3 0.780866396
4 0.168834182
5 0.865650713
6 0.42429447
7 0.323540698
8 0.865984245
9 0.856875894
10 0.721728497
11 0.634773162
12 0.563059042
13 0.545131335
14 0.029808531
15 0.645993399
16 0.143720835
17 0.949973296
18 0.14425803
19 0.414601876
20 0.53421424
21 0.826148814
22 0.194818367
23 0.942334356
24 0.837107013
25 0.291448263
26 0.242588271
27 0.939609775
28 0.500159997
29 0.428897293
30 0.41946448

grep tab separated string in perl

I am trying to grep tab separated numbers (eg 1\t3) in an array something like
#data=
1 3
2 3
1 3
3 3
the idea behind the code is something like this
#!usr/bin/perl
use strict;
use warnings;
my #data = ( "1\t3", "2\t3", "1\t3", "3\t3", );
for (my $i=0;$i<4;$i++) {
for (my $j=0;$j<4_size;$j++) {
my $pattern= "$i\t$j";
my #count=grep(/$pattern/,#data);
undef $pattern;
print "$pattern\tcount\n";
}
}
hoping for output something like
1st and second column: pattern
3nd column : count of total matches
1 1
1 2
1 3 2
2 1
2 3 1
3 1
3 2
3 3 1
but the output is null for some reasons,
I am recently learnt and finding it very intriguing.
any suggestions?
The code below:
Does not crash if input contains unexpected characters (e.g., '(')
Only counts exact matches for the sequences of digits on either side of "\t".
Matches lines that might have been read from a file or __DATA__ section without using chomp using \R.
--
#!/usr/bin/env perl
use strict;
use warnings;
my #data = ( "1\t3", "2\t3", "(\t4", "1\t3", "3\t3", "11\t3" );
for my $i (1 .. 3) {
for my $j (1 .. 3) {
my $pattern = "$i\t$j";
my $count = grep /\A\Q$pattern\E\R?\z/, #data;
print join("\t", $pattern, $count ? $count : ''), "\n";
}
}
Output:
1 1
1 2
1 3 2
2 1
2 2
2 3 1
3 1
3 2
3 3 1
You almost got it. Here is a working version:
#!usr/bin/perl
use strict;
use warnings;
my #data = ( "1\t3", "2\t3", "1\t3", "3\t3", );
foreach my $i (1 .. 3) {
foreach my $j (1 .. 3) {
my $pattern = "$i\t$j";
my $count = grep(/$pattern/, #data);
print $pattern . ($count ? "\t$count\n" : "\n");
}
}

Perl multidimensional array question

I have a program that prints out the location of commas in a paragraph of text in the form
For example if the paragraph is
one,two,three
three and a half
four,five
six
seven,eight
The program will print
0:4
0:8
2:5
4:6
I would like to use this output to create an array where the numbers after the colon are listed across columns in the row specified by the index before the colon. The array formed by the coordinates above would be
4 8
<blank or character '.'>
5
<blank or character '.'>
6
so array[0,0] = 4, array[0,1] = 8
array[1,0] = empty
array[2,0] = 5
etc...
I bet this is simple but I need help to write it.
$data_file="file.out";
open(DAT, $data_file) || die("Could not open file!");
#raw_data=<DAT>;
close(DAT);
my %array;
my $line = 0;
foreach $newline(#raw_data) {
chomp;
while ( $newline=~m/(,|;|:|and)/) {
push #{ $array{$line} }, pos($newline); # autovivification
}
$line++; }
Program
#!/usr/bin/env perl
use strict;
use warnings;
my %array;
my $line = 0;
while (<DATA>) {
chomp;
while ( /(,|;|:|(?:and))/g ) {
my $position = pos() - length($1) + 1;
push #{ $array{$line} }, $position; # autovivification
}
$line++;
}
for ( sort { $a <=> $b } keys %array ) {
my $flattened_value = join ', ', #{ $array{$_} };
print "$_ -> $flattened_value\n";
}
__DATA__
one,two,three
three and a half
four,five
six
seven,eight
Output
0 -> 4, 8
1 -> 7
2 -> 5
4 -> 6
Refer: chomp, join, keys, sort, split.
Refer the following documents to get an understanding of Perl's data structures and especially autovivification which has been used in this example.
perldoc perlref
perldoc perlreftut