I have a csv file with following sample data.
o-option(alphabetical)
v-value(numerical)
number1,o1,v1,o2,v2,o3,v3,o4,v4,o5,v5,o6,v6
number2,o1,v11,o2,v22,o3,v33,o44,v44,o5,v55,o6,v66
and so on....
Required output.
NUM,o1,o2,o3,o4,o44,o5,o6
number1,v1,v2,v3,v4,,v5,v6
number2,v11,v22,v33,,v44,v55,v66
and so on...
In this data, all the options are same i.e. o1,o2,etc through out the file but option 4 value is changing, i.e. o4,o44, etc. In total there are about 9 different option values at o4 field. Can anyone please help me with the perl code to get the required output.
I have written the below code but still not getting the required output.
my #values;
my #options;
my %hash;
while (<STDIN>) {
chomp;
my ($srn,$o1,$v1,$o2,$v2,$o3,$v3,$o4,$v4,$o5,$v5,$o6,$v6) = split /[,\n]/, $_;
push #values, [$srn,$v1,$v2,$v3,$v4,$v5,$v6];
push #options, $o1,$o2,$o3,$o4,$o5,$o6;
}
#printing the header values
my #out = grep(!$hash{$_}++,#options);
print 'ID,', join(',', sort #out), "\n";
#printing the values.
for my $i ( 0 .. $#values) {
print #{$values[$i]}, "\n";
}
Output:
ID,o1,o2,o3,o4,o44,o5,o6
number1,v1,v2,v3,v4,v5,v6
number2,v1,v2,v3,v44,v5,v6
As from the above output, when the value 44 comes, it comes under option4 and hence the other values are shifting to left. The values are not mapping with the options. Please suggest.
If you want to line the numeric values up in columns based on the value of the preceding options values, store your data rows as hashes, using the options as the keys to the hash.
use strict;
use warnings;
my (#data, %all_opts);
while (<DATA>) {
chomp;
my %h = ('NUM', split /,/, $_);
push #data, \%h;
#all_opts{keys %h} = 1;
}
my #header = sort keys %all_opts;
print join(",", #header), "\n";
for my $d (#data){
my #vals = map { defined $d->{$_} ? $d->{$_} : '' } #header;
print join(",", #vals), "\n";
}
__DATA__
number1,o1,v1,o2,v2,o3,v3,o4,v4,o5,v5,o6,v6
number2,o1,v11,o2,v22,o3,v33,o44,v44,o5,v55,o6,v66
Is this what you're after?
use strict;
use warnings;
use 5.010;
my %header;
my #store;
while (<DATA>) {
chomp;
my ($srn, %f) = split /,/;
#header{ keys %f } = 1;
push #store, [ $srn, { %f } ];
}
# header
my #cols = sort keys %header;
say join q{,} => 'NUM', #cols;
# rows
for my $row (#store) {
say join q{,} => $row->[0],
map { $row->[1]->{ $_ } || q{} } #cols;
}
__DATA__
number1,o1,v1,o2,v2,o3,v3,o4,v4,o5,v5,o6,v6
number2,o1,v11,o2,v22,o3,v33,o44,v44,o5,v55,o6,v66
Which outputs:
NUM,o1,o2,o3,o4,o44,o5,o6
number1,v1,v2,v3,v4,,v5,v6
number2,v11,v22,v33,,v44,v55,v66
Make one pass through the file identifying all the different option values, build an array of those values.
Make second pass through the file:
for each record
initialise an associative array from your list of option value
parse the assigning values for the options you have
use your list of option values to iterate the associative array printing the values
You might look at the CPAN module DBD::AnyData. One of the neater modules out there. It allows you to manipulate a CSV file like it was a database. And much more.
Related
I have lots of data dumps in a pretty huge amount of data structured as follow
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Which I would like to transform to something like:
Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all
I mean:
Generate a collection of all the keys
Generate a header line with all the Keys
Map all the values to their correct "columns" (notice that in this example I have no "Key4", and Key3/Key5 interchanged)
Possibly in Perl, since it would be easier to use in various environments.
But I am not sure if this format is unusual, or if there is a tool that already does this.
This is fairly easy using hashes and the Text::CSV_XS module:
use strict;
use warnings;
use Text::CSV_XS;
my #rows;
my %headers;
{
local $/ = "";
while (<DATA>) {
chomp;
my %record;
for my $line (split(/\n/)) {
next unless $line =~ /^([^:]+):\.+\s(.+)/;
$record{$1} = $2;
$headers{$1} = $1;
}
push(#rows, \%record);
}
}
unshift(#rows, \%headers);
my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));
for my $row_ref (#rows) {
$csv->print_hr(*STDOUT, $row_ref);
}
__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Output:
Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"
If your CSV format is 'complicated' - e.g. it contains commas, etc. - then use one of the Text::CSV modules. But if it isn't - and this is often the case - I tend to just work with split and join.
What's useful in your scenario, is that you can map key-values within a record quite easily using a regex. Then use a hash slice to output:
#!/usr/bin/env perl
use strict;
use warnings;
#set paragraph mode - records are blank line separated.
local $/ = "";
my #rows;
my %seen_header;
#read STDIN or files on command line, just like sed/grep
while ( <> ) {
#multi - line pattern, that matches all the key-value pairs,
#and then inserts them into a hash.
my %this_row = m/^(\w+):\.+ (.*)$/gm;
push ( #rows, \%this_row );
#add the keys we've seen to a hash, so we 'know' what we've seen.
$seen_header{$_}++ for keys %this_row;
}
#extract the keys, make them unique and ordered.
#could set this by hand if you prefer.
my #header = sort keys %seen_header;
#print the header row
print join ",", #header, "\n";
#iterate the rows
foreach my $row ( #rows ) {
#use a hash slice to select the values matching #header.
#the map is so any undefined values (missing keys) don't report errors, they
#just return blank fields.
print join ",", map { $_ // '' } #{$row}{#header},"\n";
}
This for you sample input, produces:
Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,
If you want to be really clever, then most of that initial building of the loop can be done with:
my #rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;
The problem then is - you would need to build up the 'headers' array still, and that means a bit more complicated:
$seen_header{$_}++ for map { keys %$_ } #rows;
It works, but I don't think it's as clear about what's happening.
However the core of your problem may be the file size - that's where you have a bit of a problem, because you need to read the file twice - first time to figure out which headings exist throughout the file, and then second time to iterate and print:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";
my %seen_header;
while ( <$input> ) {
$seen_header{$_}++ for m/^(\w+):/gm;
}
my #header = sort keys %seen_header;
#return to the start of file:
seek ( $input, 0, 0 );
while ( <$input> ) {
my %this_row = m/^(\w+):\.+ (.*)$/gm;
print join ",", map { $_ // '' } #{$this_row}{#header},"\n";
}
This will be slightly slower, as it'll have to read the file twice. But it won't use nearly as much memory footprint, because it isn't holding the whole file in memory.
Unless you know all your keys in advance, and you can just define them, you'll have to read the file twice.
This seems to work with the data you've given
use strict;
use warnings 'all';
my %data;
while ( <> ) {
next unless /^(\w+):\W*(.*\S)/;
push #{ $data{$1} }, $2;
}
use Data::Dump;
dd \%data;
output
{
Key1 => ["Value", "Different value"],
Key2 => ["Other value"],
Key3 => ["Maybe another value yet", "Invaluable"],
Key5 => ["Has no value at all"],
}
I have a big tab-separated file with duplicate products but with different colours and amounts. I’m trying to merge the data based on the key so that I end up with one product and the combined colours and amounts separated by a delimiter (comma in this case).
I'm using the Text::CSV module so that I have better control, and because it allows me to output the file with a different delimiters (from semicolon to pipe).
My question is, how do I merge the data properly? I don't want it simply to combine colours and amounts but remove duplicate values as well. So I was thinking a key/value with the Id/Amount and Id/Colour. But Id isn't unique so how do I do this? Do I create an array or use hashes?
Here is some sample source data, with the tab separators replaced by semicolons ;. Note that the marked row has no Colour so the empty value is not combined in the result.
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla;
101;Fruits;50020;Strawberry;500;Red;1;
101;Fruits;50020;Strawberry;1000;Red;1;
201;Vegetables;60090;Tomato;50;Green;1;
201;Vegetables;60080;Onion;1;Purple;1;
201;Vegetables;60090;Tomato;100;Red;1;
201;Vegetables;60010;Carrot;100;Purple;1;
201;Vegetables;60050;Broccoli;500;Green;1;
201;Vegetables;60050;Broccoli;1000;Green;1;
201;Vegetables;60090;Tomato;500;Yellow;1;
101;Fruits;50060;Apple;500;Green;1;
101;Fruits;50010;Grape;500;Red;1;
201;Vegetables;60010;Carrot;500;White;1;
201;Vegetables;60050;Broccoli;2000;Green;1;
201;Vegetables;60090;Tomato;1000;Red;1;
101;Fruits;50020;Strawberry;100;Red;1;
101;Fruits;50060;Apple;1000;Red;1;
201;Vegetables;60010;Carrot;250;Yellow;1;
101;Fruits;50010;Grape;100;White;1;
101;Fruits;50030;Banana;500;Yellow;1;
201;Vegetables;60010;Carrot;1000;Yellow;1;
101;Fruits;50030;Banana;1000;Green;1;
101;Fruits;50020;Strawberry;200;Red;1;
101;Fruits;50010;Grape;200;White;1;
201;Vegetables;60010;Carrot;50;Orange;1;
201;Vegetables;60080;Onion;2;White;1;
And the desired result I'm trying to get:
101;Fruits;50010;Grape;100,500,200;Red,White;1;
201;Vegetables;60090;Tomato;50,500,1000,10;Yellow,Green,Red;1;
101;Fruits;50060;Apple;500,1000;Red,Green;1;
201;Vegetables;60010;Carrot;250,50,500,1000,100;Orange,Yellow,White,Purple;1;
201;Vegetables;60050;Broccoli;1000,500,2000;Green;1;
101;Fruits;50020;Strawberry;100,1000,200,500;Red;1;
101;Fruits;50030;Banana;500,1000;Yellow,Green;1;
201;Vegetables;60080;Onion;2,1;White,Purple;1;
This is my script so far. It's not finished (and not working) because I'm not sure how to continue. I don't think this can work right because I'm trying to use the same key for different colours.
use strict;
use warnings;
use Text::CSV;
use List::MoreUtils 'uniq';
my $inputfile = shift || die "Give input and output names!\n";
my $outputfile = shift || die "Give output name!\n";
open my $infile, '<', $inputfile or die "Sourcefile in use / not found :$!\n";
open my $outfile, '>', $outputfile or die "Outputfile in use :$!\n";
binmode($outfile, ":encoding(utf8)");
my $csv_in = Text::CSV->new({binary => 1,sep_char => ";",eol => $/});
my $csv_out = Text::CSV->new({binary => 1,sep_char => "|",always_quote => 1,eol => $/}); #,quote_null => 0 #
my %data;
while (my $elements = $csv_in->getline($infile)){
my $id = $elements->[2];
push #{ $data{$id} }, \#elements;
}
for my $id ( sort keys %data ){
my $set = $data{$id};
my #elements = #{ $set->[0] };
$elements[4] = join ',', uniq map { $_->[4] } #$set;
$elements[5] = join ',', uniq map { $_->[5] } #$set;
$csv_in->combine(#$elements);
$csv_out->print($outfile, $elements);
}
Edit: I'm using data::dumper for testing but eventually want it written to a file.
Hashes deal with unique keys. As you've correctly surmised - if you 'overwrite' colour, then ... the old value is replaced.
But hashes can contain array(ref)s. So you can do:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $id = 50010;
my %hash;
$hash{$id}{'colour'} = [ "red", "green", "blue" ];
push( #{ $hash{$id}{'colour'} }, "orange" );
print Dumper \%hash;
This'll work, provided you don't have any duplicates for the colours. (e.g. there's only one line for White Grapes with that ID.).
You may have to post-process with join, to turn the array into a string.
Or as an alternative, you could concatenate the colours onto the existing:
if ( defined $hash->{$id}->{colour} ) {
$hash->{$id}->{colour} .= ",$colour";
}
I would also note - I'm unclear what you're doing with $elements->[10] because there aren't 10 columns. I would also strongly suggest not using generic names for variables - like %hash - because it's just a bad habit to get into. Vague variable names are bad style and whilst it's largely academic when you're looking at a small chunk of code, it pays to get into the habit of making it clear what you can expect to be in a particular variable. (Especially true if it's not clear of data types)
I don't have time to write a proper commentary, but this program seems to do what you need. It uses the uniq function from the List::MoreUtils modules. It isn't a core module and so may need installing. I trust that it's not important what order the Amounts and Colours appear in the combined fields?
use strict;
use warnings;
use List::MoreUtils 'uniq';
print scalar <DATA>;
my %data;
while (<DATA>) {
chomp;
my #fields = split /;/;
my $id = $fields[2];
push #{ $data{$id} }, \#fields;
}
for my $id ( sort keys %data ) {
my $set = $data{$id};
my #fields = #{ $set->[0] };
$fields[4] = join ',', uniq map { $_->[4] } #$set;
$fields[5] = join ',', uniq map { $_->[5] } #$set;
print join(';', #fields, ''), "\n";
}
__DATA__
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla;
101;Fruits;50020;Strawberry;500;Red;1;
101;Fruits;50020;Strawberry;1000;Red;1;
201;Vegetables;60090;Tomato;50;Green;1;
201;Vegetables;60080;Onion;1;Purple;1;
201;Vegetables;60090;Tomato;100;Red;1;
201;Vegetables;60010;Carrot;100;Purple;1;
201;Vegetables;60050;Broccoli;500;Green;1;
201;Vegetables;60050;Broccoli;1000;Green;1;
201;Vegetables;60090;Tomato;500;Yellow;1;
101;Fruits;50060;Apple;500;Green;1;
101;Fruits;50010;Grape;500;Red;1;
201;Vegetables;60010;Carrot;500;White;1;
201;Vegetables;60050;Broccoli;2000;Green;1;
201;Vegetables;60090;Tomato;1000;Red;1;
101;Fruits;50020;Strawberry;100;Red;1;
101;Fruits;50060;Apple;1000;Red;1;
201;Vegetables;60010;Carrot;250;Yellow;1;
101;Fruits;50010;Grape;100;White;1;
101;Fruits;50030;Banana;500;Yellow;1;
201;Vegetables;60010;Carrot;1000;Yellow;1;
101;Fruits;50030;Banana;1000;Green;1;
101;Fruits;50020;Strawberry;200;Red;1;
101;Fruits;50010;Grape;200;White;1;
201;Vegetables;60010;Carrot;50;Orange;1;
201;Vegetables;60080;Onion;2;White;1;
output
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla;
101;Fruits;50010;Grape;500,100,200;Red,White;1;
101;Fruits;50020;Strawberry;500,1000,100,200;Red;1;
101;Fruits;50030;Banana;500,1000;Yellow,Green;1;
101;Fruits;50060;Apple;500,1000;Green,Red;1;
201;Vegetables;60010;Carrot;100,500,250,1000,50;Purple,White,Yellow,Orange;1;
201;Vegetables;60050;Broccoli;500,1000,2000;Green;1;
201;Vegetables;60080;Onion;1,2;Purple,White;1;
201;Vegetables;60090;Tomato;50,100,500,1000;Green,Red,Yellow;1;
If I have a colon-delimited file name FILE and I do:
cat FILE|perl -F: -lane 'my %hash = (); $hash{#F[0]} = #F[2]'
to assign the first and 3rd tokens as the key => value pairs for the hash..
1) Is that a sane way to assign key value pairs to a hash?
2) What is the simplest way to now find all keys with shared values and list them?
Assume FILE looks like:
Mike:34:Apple:Male
Don:23:Corn:Male
Jared:12:Apple:Male
Beth:56:Maize:Female
Sam:34:Apple:Male
David:34:Apple:Male
Desired Output: Keys with value "Apple": Mike,Jared,David,Sam
Your example won't work as you want because the -n option puts a while loop around your one-line program, so the hash you declare is created and destoyed for every record in the file. You could get around that by not declaring the hash, and so making it a persistent package variable which will retain all values stored in it.
You can then write push #{ $hash{$F[2]} }, $F[0] but notice that it should be $F[0] etc. and not #F[0], and I have used push to create a list of column 1 values for each column 3 value instead of just a list of one-to-one values relating each column 1 value with its column 3 value.
To clarify, your method produces a hash looking like this, which has to be searched to produce the display that you want.
(
Beth => "Maize",
David => "Apple",
Don => "Corn",
Jared => "Apple",
Mike => "Apple",
Sam => "Apple",
)
while mine creates this, which as you can see is pretty much already in the form you want.
(
Apple => ["Mike", "Jared", "Sam", "David"],
Corn => ["Don"],
Maize => ["Beth"],
)
But I think this problem is a bit too big to be solved with a one-line Perl program. The solution below expects the path to the input file as a command-line parameter, like this
> perl prog.pl colons.csv
but it will default to myfile.csv if no file is specified.
use strict;
use warnings;
our #ARGV = 'myfile.csv' unless #ARGV;
my %data;
while (<>) {
my #fields = split /:/;
push #{ $data{$fields[2]} }, $fields[0];
}
while (my ($k, $v) = each %data) {
next unless #$v > 1;
printf qq{Keys with value "%s": %s\n}, $k, join ', ', #$v;
}
output
Keys with value "Apple": Mike, Jared, Sam, David
use strict;
use warnings;
open my $in, '<', 'in.txt';
my %data;
while(<$in>){
chomp;
my #split = split/:/;
$data{$split[0]} = $split[2];
}
my $query = 'Apple';
print "Keys with value $query = ";
foreach my $name (keys %data){
print "$name " if $data{$name} eq $query;
}
print "\n";
Arrays are used to hold list of values, so use an array.
perl -F: -lane'
push #{ $h{$F[2]} }, $F[0];
END {
for my $fruit (keys %h) {
next if #{ $h{$fruit} } < 2;
print "$fruit: ", join(",", #{ $h{$fruit} });
}
}
' FILE
The END block is executed on exit. In it, we iterate over the keys of the hash. If the value of the current hash element is an array with only one element, it's skipped. Otherwise, we prints the key followed by contents of the array referenced by the hash element.
Here is another way:
perl -F: -lane'
push #{ $h{$F[2]} }, $F[0];
}{
print "$_: ", join(",", #{ $h{$_} }) for grep { #{$h{$_}} > 1 } keys %h;
' file
We read each line and create hash of arrays using third column as key and first column as list of values for matching key. In the END block we iterate over our hash using grep and filter keys whose array count greater than 1 and print the key followed by array elements.
It doesn't have to be a one liner,
Good. It's not going to be...
Is that a sane way to assign key value pairs to a hash?
You're simply assigning the key value pairs as:
$hash{"key"} = "value";
Which is about as simple as it gets. There might be a way of doing it via map. However, the main issue I see is what should happen if you have duplicate keys.
Let's say your file looks like this:
Mike:34:Apple:Male
Don:23:Corn:Male
Jared:12:Apple:Male
Beth:56:Maize:Female
Sam:34:Apple:Male
David:34:Apple:Male # Note this entry is here twice!
David:35:Wheat:Male # Note this entry is here twice!
Let's do a simple assignment loop:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = $category;
}
When you get to $hash{David}, it will first be set to Apple, but then you change the value to Wheat. There are four ways you can handle this:
Use whatever the last value is. No change in the loop.
Use the first value and ignore subsequent values. Simple enough to do.
If that happens, it's an error. Abort the program and report the error.
Keep all values.
This last one is the most interesting because it involves a reference to an array as the values for your hash:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = [] if not exists $hash{$name}; # I'm making this an array reference
push #{ $hash{$name} }, $category;
}
Now, each value in my hash is a reference to an array:
my #values = #{ $hash{David} ); # The values of David...
print "David is in categories " . join ( ", ", #values ) . "\n";
This will print out David is in categories Wheat, Apple
What is the simplest way to now find all keys with shared values and list them?
The easiest way is to create a second hash that's keyed by your value. In this hash, you will need to use an array reference. Let's assume no duplicate names for now:
my %hash;
my %indexed_hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = $category;
my $indexed_hash{$category} = [] if not exist $indexed_hash{$category};
push #{ $indexed_hash{$category} }, $name;
}
Now, if I want to find all the duplicates of Apple:
my #names = #{ $indexed_hash{Apple} };
print "The following are in 'Apple': " . join ( ", " #names ) . "\n";
Since we're getting into references, we could take things a step further and store all of your values of your file in your hash. Again, for simplicity, I am assuming that you will have one and only one entry per name:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name}->{AGE} = $age;
$hash{$name}->{CATEGORY} = $category;
$hash{$name}->{SEX} = $sex;
}
for my $name ( sort keys %hash ) {
print "$name Information:\n";
print " Age: " . $hash{$name}->{AGE} . "\n";
printf "Category: %s\n", $hash{$name}->{CATEGORY};
print " Sex: #{[$hash{$name}->{SEX}]}\n\n";
}
That last two statements are easier ways of interpolating complex data structures into a string. The printf is fairly clear. The second #{[...]} is a neat little trick.
What have you tried?
If you reverse the hash into a list of value => key pairs then use List::Util's pairs() against the list, you can transform the hash into a hash of values => key arrayrefs. i.e. ( foo => [ 'bar', 'baz' ] ), grep {#{$hash{$_}} > 1} keys %hash, and print the results.
Whats the best way to summarize data from a file that has around 2 million records in Perl?
For eg: A file like this,
ABC|XYZ|DEF|EGH|100
ABC|XYZ|DEF|FGH|200
SDF|GHT|WWW|RTY|1000
SDF|GHT|WWW|TYU|2000
Needs to be summarized on the first 3 columns like this,
ABC|XYZ|DEF|300
SDF|GHT|WWW|3000
Chris
Assuming there are always five columns, the fifth of which is numeric, and you always want the first three columns to be the key...
use warnings;
use strict;
my %totals_hash;
while (<>)
{
chomp;
my #cols = split /\|/;
my $key = join '|', #cols[0..2];
$totals_hash{$key} += $cols[4];
}
foreach (sort keys %totals_hash)
{
print $_, '|', $totals_hash{$_}, "\n";
}
You can use a hash as:
my %hash;
while (<DATA>) {
chomp;
my #tmp = split/\|/; # split each line on |
my $value = pop #tmp; # last ele is the value
pop #tmp; # pop unwanted entry
my $key = join '|',#tmp; # join the remaining ele to form key
$hash{$key} += $value; # add value for this key
}
# print hash key-values.
for(sort keys %hash) {
print $_ . '|'.$hash{$_}."\n";
}
Ideone link
Presuming your input file has its records in separate lines.
perl -n -e 'chomp;#a=split/\|/;$h{join"|",splice#a,0,3}+=pop#a;END{print map{"$_: $h{$_}\n"}keys%h}' < inputfile
1-2-3-4 I declare A CODE-GOLF WAR!!! (Okay, a reasonably readable code-golf dust-up.)
my %sums;
m/([^|]+\|[^|]+\|[^|]+).*?\|(\d+)/ and $sums{ $1 } += $2 while <>;
print join( "\n", ( map { "$_|$sums{$_}" } sort keys %sums ), '' );
Sort to put all records with the same first 3 triplets next to each other. Iterate through and kick out a subtotal when a different set of triplets appears.
$prevKey="";
$subtotal=0;
open(INPUTFILE, "<$inFile");
#lines=<INPUTFILE>;
close (INPUTFILE);
open(OUTFILE, ">$outFile");
#sorted=sort(#lines);
foreach $line(#lines){
#parts=split(/\|/g, $line);
$value=pop(#parts);
$value-=0; #coerce string to a number
$key=$parts[0]."|".$parts[1]."|".$parts[2];
if($key ne $prevKey){
print OUTFILE "$prevKey|$subtotal\n";
$prevKey=$key;
$subtotal=0;
}
$subtotal+=$value;
}
close(OUTFILE);
If sorting 2 million chokes your box then you may have to put each record into a file based on the group and then do the subtotal for each file.
I have to parse a file so that I can import it to excel. So, I thought the best way was to create a csv file. In this file, I have to divide contents into different categories and represent them in different columns. So, I have parsed the file to create different arrays corresponding to the categories. Now, I am trying to create a csv file with these arrays (thought of using a for loop). But the problem is, that arrays are of unequal length.
INPUT
NM_144736.3
NM_144963.1
XM_144975.2
BC144986.1
NM_144989.1
BC145001.1
XM_145018.2
NM_145015.2
XM_030711.2
AK145024.1
AK145030.1
NM_145034.1
I have used regex to parse data into different arrays. All the NM to #array1, XM to #array2, BC to #array3, AK to #array4.
If creating arrays is not a good idea, please let me know what is? How else can I go about generating csv file from the above data.
Edit:
OUTPUT
NM_144963.1,XM_144975.2,BC144986.1,AK145024.1
NM_144963.1,XM_145018.2,BC145001.1,AK145030.1
NM_144989.1,XM_030711.2
NM_145015.2
NM_145034.1
Parse and write directly to an excel spreadsheet, without importing:
use Spreadsheet::WriteExcel;
my %hash;
# Parse the data into a hash of arrayrefs
push #{$hash{substr $_, 0, 2}} => $_ for <DATA>;
# Create spreadsheet
my $workbook = Spreadsheet::WriteExcel->new('perl.xls');
my $worksheet = $workbook->add_worksheet;
# Loop through hashref keys
my #array = sort keys %hash;
for (0..#array-1) {
# Create column based on arrayref
$worksheet->write_col(0, $_, $hash{$array[$_]});.
}
# Close and save spreadsheet
$workbook->close;
Using parallel arrays like that is a bad idea. In fact, whenever you find yourself using names such as #array1, #array2 etc, recognize that it is bad idea. And, no, naming the arrays #NM, #XM etc would not have made it better.
The way I see it, you have a single column of data and you have not specify how to split that single column in to multiple columns. ... Nope, my mind reading abilities fell short. Please post desired output and don't leave to our imagination to figure it out.
use strict; use warnings;
use List::AllUtils qw( each_arrayref);
my #fields = qw( NM XM BC AK );
my %data;
while ( <DATA> ) {
chomp;
if ( /^([A-Z]{2})_?[0-9]+\.[0-9]$/ ) {
push #{ $data{$1} }, $_;
}
}
print join(',', #fields), "\n";
my $it = each_arrayref #data{ #fields };
while ( my #values = $it->() ) {
print join(',', map{ defined($_) ? $_ : '' } #values ), "\n";
}
__DATA__
NM_144736.3
NM_144963.1
XM_144975.2
BC144986.1
NM_144989.1
BC145001.1
XM_145018.2
NM_145015.2
XM_030711.2
AK145024.1
AK145030.1
NM_145034.1
Output:
NM,XM,BC,AK
NM_144736.3,XM_144975.2,BC144986.1,AK145024.1
NM_144963.1,XM_145018.2,BC145001.1,AK145030.1
NM_144989.1,XM_030711.2,,
NM_145015.2,,,
NM_145034.1,,,