how to count number of occurances in hash values - perl

I have a below fix file and I want to find out how many orders are sent at same time. I am using tag 52 as the sending time.
Below is the file,
8=FIX.4.2|9=115|35=A|52=20080624-12:43:38.021|10=186|
8=FIX.4.2|52=20080624-12:43:38.066|10=111|
8=FIX.4.2|9=105|35=1|22=BOO|52=20080624-12:43:39.066|10=028|
If I want to count number how many same occurances of Tag 52 values were sent? How can I check?
So far, I have written below code but not giving me the frequency.
#!/usr/bin/perl
$f = '2.txt';
open (F,"<$f") or die "Can not open\n";
while (<F>)
{
chomp $_;
#data = split (/\|/,$_);
foreach $data (#data)
{
if ( $data == 52){
#data1 = split ( /=/,$data);
for my $j (#data1)
{
$hash{$j}++;
} for my $j (keys %hash)
{
print "$j: ", $hash{j}, "\n";
}
}
}
}

Here is your code corrected:
#!/usr/bin/perl
$f = '2.txt';
open (F,"<$f") or die "Can not open\n";
my %hash;
while (<F>) {
chomp $_;
#data = split (/\|/,$_);
foreach $data (#data) {
if ($data ~= /^52=(.*)/) {
$hash{$1}++;
}
}
}
for my $j (keys %hash) {
print "$j: ", $hash{j}, "\n";
}
Explanation:
if ( $data == 52) compares the whole field against value 52, not a substring of the field. Of course, you do not have such fields, and the test always fails. I replaces it with a regexp comparison.
The same regexp gives an opportunity to catch a timestamp immediately, without a need to split the field once more. It is done by (.*) in the regexp and $1 in the following assignment.
It is hardly makes sense to output the hash for every line of input data (your code outputs it within the foreach loop). I moved it down. But maybe, outputting the current hash for every line is what you wanted, I do not know.

Related

Counting through a hash - PERL

I have a db of places people have ordered items from. I parsed the list to get the city and state so it prints like this - city, state (New York, NY) etc....
I use the variables $city and $state but I want to count how many times each city and state occur so it looks like this - city, state, count (Seattle, WA 8)
I have all of it working except the count .. I am using a hash but I can't figure out what is wrong with this hash:
if ($varc==3) {
$line =~ /(?:\>)(\w+.*)(?:\<)/;
$city = $1;
}
if ($vars==5) {
$line =~ /(?:\>)((\w+.*))(?:\<)/;
$state = $1;
# foreach $count (keys %counts){
# $counts = {$city, $state} {$count}++;
# print $counts;
# }
print "$city, $state\n";
}
foreach $count (keys %counts){
$counts = {$city, $state} {$count}++;
print $counts;
}
Instead of printing city and state you can build a "location" string with both items and use the following counting code:
# Declare this variable before starting to parse the locations.
my %counts = ();
# Inside of the loop that parses the city and state, let's assume
# that you've got $city and $state already...
my $location = "$city, $state";
$counts{$location} += 1;
}
# When you've processed all locations then the counts will be correct.
foreach $location (keys %counts) {
print "OK: $location => $counts{$location}\n";
}
# OK: New York, NY => 5
# OK: Albuquerque, NM => 1
# OK: Los Angeles, CA => 2
This is going to be a mix of an answer and a code review. I will start with a warning though.
You are trying to parse what looks like XML with Regular Expressions. While this can be done, it should probably not be done. Use an existing parser instead.
How do I know? Stuff that is between angle brackets looks like the format is XML, unless you have a very weird CSV file.
# V V
$line =~ /(?:\>)(\w+.*)(?:\<)/;
Also note that you don't need to escape < and >, they have no special meaning in regex.
Now to your code.
First, make sure you always use strict and use warnings, so you are aware of stuff that goes wrong. I can tell you're not because the $count in your loop has no my.
What's $vars (with an s), and what's $varc (with a c). I am guessing that has to do with the state and the city. Is it the column number? In an XML file? Huh.
$line =~ /(?:\>)((\w+.*))(?:\<)/;
Why are there two capture groups, both capturing the same thing?
Anyway, you want to count how often each combination of state and city occurs.
foreach $count (keys %counts){
$counts = {$city, $state} {$count}++;
print $counts;
}
Have you run this code? Even without strict, it gives a syntax error. I'm not even sure what it's supposed to do, so I can't tell you how to fix it.
To implement counting, you need a hash. You got that part right. But you need to declare that hash variable outside of your file reading loop. Then you need to create a key for your city and state combination in the hash, and increment it every time that combination is seen.
my %counts; # declare outside the loop
while ( my $line = <$fh> ) {
chomp $line;
if ( $varc == 3 ) {
$line =~ /(?:\>)(\w+.*)(?:\<)/;
$city = $1;
}
if ( $vars == 5 ) {
$line =~ /(?:\>)((\w+.*))(?:\<)/;
$state = $1;
print "$city, $state\n";
$count{"$city, $state"}++; # increment when seen
}
}
You have to parse the whole file before you can know how often each combination is in the file. So if you want to print those together, you will have to move the printing outside of the loop that reads the file, and iterate the %count hash by keys at a later point.
my %counts; # declare outside the loop
while ( my $line = <$fh> ) {
chomp $line;
if ( $varc == 3 ) {
$line =~ /(?:\>)(\w+.*)(?:\<)/;
$city = $1;
}
if ( $vars == 5 ) {
$line =~ /(?:\>)((\w+.*))(?:\<)/;
$state = $1;
$count{"$city, $state"}++; # increment when seen
}
}
# iterate again to print final counts
foreach my $item ( sort keys %counts ) {
print "$item $counts{$item}\n";
}

sort 2 key hash by value

i keep learning hashes and various things u can do with them.
taday i have this question. how do i sort a hash by value, when i have 2 keys in it? and how do i print it out?
i have a csv file. im trying to store values in the hash, sort it by value. this way I'll be able to print the biggest and the smallest value, i also need the date this value was there.
so far i can print the hash, but i cant sort it.
#!/usr/bin/perl
#find openMin and openMax.
use warnings;
use strict;
my %pick;
my $key1;
my $key2;
my $value;
my $file= 'msft2.csv';
my $lines = 0;
my $date;
my $mm;
my $mOld = "";
my $open;
my $openMin;
my $openMax;
open (my $fh,'<', $file) or die "Couldnt open the $file:$!\n";
while (my $line=<$fh>)
{
my #columns = split(',',$line);
$date = $columns[0];
$open = $columns[1];
$mm = substr ($date,5,2);
if ($lines>=1) { #first line of file are names of columns wich i
$key1 = $date; #dont need. data itself begins with second line
$key2 = "open";
$value = $open;
$pick{$key1}{"open"}=$value;
}
$lines++;
}
foreach $key1 (sort keys %pick) {
foreach $key2 (keys %{$pick{$key1}}) {
$value = $pick{$key1}{$key2};
print "$key1 $key2 $value \n";
}
}
exit;
1. Use a real CSV parser
Parsing a CSV with split /,/ works fine...unless one of your fields contains a comma. If you are absolutely, positively, 100% sure that your code will never, ever have to parse a CSV with a comma in one of the fields, feel free to ignore this. If not, I'd recommend using Text::CSV. Example usage:
use Text::CSV;
my $csv = Text::CSV->new( { binary => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
open my $fh, "<", $file or die "Failed to open $file: $!";
while (my $line = $csv->getline($fh)) {
print #$line, "\n";
}
$csv->eof or $csv->error_diag();
close $fh;
2. Sorting
I only see one secondary key in your hash: open. If you're trying to sort based on the value of open, do something like this:
my %hash = (
foo => { open => "date1" },
bar => { open => "date2" },
);
foreach my $key ( sort { $hash{$a}{open} cmp $hash{$b}{open} } keys %hash ) {
print "$key $hash{$key}{open}\n";
}
(this assumes that the values you're sorting are not numeric. If the values are numeric (e.g. 3, -17.57) use the spaceship operator <=> instead of the string comparison operator cmp. See perldoc -f sort for details and examples.)
EDIT: You haven't explained what format your dates are in. If they are in YYYY-MM-DD format, sorting as above will work, but if they're in MM-DD-YYYY format, for example, 01-01-2014 would come before 12-01-2013. The easiest way to take care of this is to reorder the components of your date from most to least significant (i.e. year followed by month followed by day). You can do this using Time::Piece like this:
use Time::Piece;
my $date = "09-26-2013";
my $t = Time::Piece->strptime($date, "%m-%d-%Y");
print $t->strftime("%Y-%m-%d");
Another tidbit: in general you should only declare variables right before you use them. You gain nothing by declaring everything at the top of your program except decreased readability.
You could concatenate key1 and key2 into a single key as:
$key = "$key1 key2";
$pick{$key} = $value;

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.

Perl Array Values Access and Sum by each unique key

# my code as follows
use strict;
use FileHandle;
my #LISTS = ('incoming');
my $WORK ="c:\";
my $OUT ="c:\";
foreach my $list (#LISTS) {
my $INFILE = $WORK."test.dat";
my $OUTFILE = $OUT."TEST.dat";
while (<$input>) {
chomp;
my($f1,$f2,$f3,$f4,$f5,$f6,$f7) = split(/\|/);
push #sum, $f4,$f7;
}
}
while (#sum) {
my ($key,$value)= {shift#sum, shift#sum};
$hash{$key}=0;
$hash{$key} += $value;
}
while my $key (#sum) {
print $output2 sprintf("$key1\n");
# print $output2 sprintf("$key ===> $hash{$key}\n");
}
close($input);
close($output);
I am getting errors Unintialized error at addition (+) If I use 2nd print
I get HASH(0x19a69451) values if I use 1st Print.
I request you please correct me.
My output should be
unique Id ===> Total Revenue ($f4==>$f7)
This is wrong:
"c:\";
Perl reads that as a string starting with c:";\n.... Or in other words, it is a run away string. You need to write the last character as \\ to escape the \ and prevent it from escaping the subsequent " character
You probably want to use parens instead of braces:
my ($key, $value) = (shift #sum, shift #sum);
You would get that Unintialized error at addition (+) warning if the #sum array has an odd number of elements.
See also perltidy.
You should not enter the second while loop :
while my $key (#sum) {
because the previous one left the array #sum empty.
You could change to:
while (<$input>) {
chomp;
my #tmp = split(/\|/);
$hash{$tmp[3]} += $tmp[6];
}
print Dumper \%hash;

Perl merging 2 csv files line by line with a primary key

Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};