With the below csv data
name,place,animal
a,,
b,,
a,,
,b,
The name field is available in 3 rows but not available in 1 row
The place field is available in 1 row but not in 3 row
The animal field is empty in all the rows -> Get these column names1
i would like to get the column names only if it empty in all the rows.
I am trying to write a perl script for the same but not sure how to attack this problem.
step 1: Check all the columns in first row, if any column is not empty ,dont search it in next row
step2: keep repeating step1 in a loop and finally we will get the output.and this brings down the complexity as we are not bothered about columns that have value even once.
i will implement the code and post it here.
But if u have any new ideas, please advise me
Thanks
For CSV files with no quotes and escaping, just keep a hash of empty columns so far. Reading the file line by line, remove any non-empty column from the hash:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
chomp( my #column_names = split /,/, <> );
my %empty;
#empty{ #column_names } = ();
while (<>) {
chomp;
my #columns = split /,/;
for my $i (0 .. $#columns) {
delete $empty{ $column_names[$i] } if length $columns[$i];
}
}
say for keys %empty;
For real CSV files, use Text::CSV_XS, but the method is the same: populate a hash by column names, then remove the non empty ones:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use Text::CSV_XS qw{ csv };
my %empty;
csv(in => shift,
out => \ 'skip',
headers => sub { undef $empty{ $_[0] }; $_[0] },
on_in => sub {
my (undef, $columns) = #_;
delete #empty{ grep length $columns->{$_}, keys %$columns }
},
);
say for keys %empty;
As rows are processed update an ancillary array which keeps track of each field's truth-value
If any field in a new row is non-empty the corresponding element of the array flips to true; otherwise it stays false. In the end indices of array's false elements identify indices of empty columns.
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = 'cols.csv';
my $csv = Text::CSV->new( { binary => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
open my $fh, '<', $file or die "Can't open $file: $!";
my #col_names = #{ $csv->getline($fh) };
my #mask;
while (my $line = $csv->getline($fh)) {
#mask = map { $mask[$_] || $line->[$_] ne '' } (0..$#$line);
}
for (0..$#mask) {
say "Column \"$col_names[$_]\" is empty" if not $mask[$_];
}
Syntax: $#$line is the index of the last element of arrayref $line (like $#ary is for #ary)
Related
I have an array (#array) which has list of elements. I need to check whether these each of the elements are exists in master file or not. If the element exists in master file then in the same line of master file the string YES (in 5th position) should also exists. And the element should be stored in different array.
Actually my script uses two grep shell command to achieve this. How can I write same thing in Perl do grep.
...
use Data::Dumper;
my #new_array;
my #array = ('RT0AC1', 'WG3RA3');
print Dumper(\#array);
foreach ( #array ){
my $line = `grep $_ "master_file.csv" | grep -i yes`;
next unless($line);
push( #new_array, $_ );
}
print Dumper(#new_array);
...
where master_file.csv looks like this:
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
So Here I am getting $line value as 102,RT0AC1,CONNECTED,WORKING,YES and element RT0AC1 is getting stored in #new_array.
How can I avoid using backtick(`) and two greps to achieve this. I am trying to do this using pure Perl. Also the master_file.csv contains millions of records.
Since all the words you're looking for are in the same location, it's easy to just split up the current line on commas and see if the second column exists in a hash table, and if the fifth column is equal to "YES":
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
my $filename = shift // "master_file.csv"; # Default filename if not given on command line
my #array = qw/RT0AC1 WG3RA3/; # Words you're looking for
my %words = map { $_ => 1 } #array; # Store them in a hash for fast lookup
my #new_array;
# Use Text::CSV_XS for non-trivial CSV files
open my $csv, "<", $filename;
while (<$csv>) {
chomp;
my #F = split /,/;
push #new_array, $F[1] if exists $words{$F[1]} && $F[4] eq "YES";
}
print Dumper(\#new_array);
Form regex to match records of interest, split line into fields and compare field #5 to YES. If there is a match increase a count for field #2 in %match hash.
Once the file processed %match hash will have matched records field #2 as a key and value will reflect how many times this field was matched with YES in the file.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<DATA>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
Output
$VAR1 = {
'RT0AC1' => 1
};
Remove DATA to get final code and give filename on command line to process file with data of interest
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
while(<>) {
chomp;
next unless /$re_filter/;
my #data = split(',',$_);
$match{$data[1]}++ if $data[4] eq 'YES';
}
say Dumper(\%match);
An alternative version based on regular expression without using split
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my %match;
my #look_for = qw(RT0AC1 WG3RA3);
my $re_filter = join('|',#look_for);
my $regex = qr/^\d+,($re_filter),[^,]+,[^,]+,YES$/;
/$regex/ && $match{$1}++ for <DATA>;
say Dumper(\%match);
__DATA__
101,RT0AC1,CONNECTED,FAULTY,NO
102,RT0AC1,CONNECTED,WORKING,YES
103,RT0AC1,NOT CONNECTED,WORKING,NO
104,WG3RA3,NOT CONNECTED,DISABLED,NO
105,WG3RA3,CONNECTED,WORKING,NO
I have lots of data dumps in a pretty huge amount of data structured as follow
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Which I would like to transform to something like:
Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all
I mean:
Generate a collection of all the keys
Generate a header line with all the Keys
Map all the values to their correct "columns" (notice that in this example I have no "Key4", and Key3/Key5 interchanged)
Possibly in Perl, since it would be easier to use in various environments.
But I am not sure if this format is unusual, or if there is a tool that already does this.
This is fairly easy using hashes and the Text::CSV_XS module:
use strict;
use warnings;
use Text::CSV_XS;
my #rows;
my %headers;
{
local $/ = "";
while (<DATA>) {
chomp;
my %record;
for my $line (split(/\n/)) {
next unless $line =~ /^([^:]+):\.+\s(.+)/;
$record{$1} = $2;
$headers{$1} = $1;
}
push(#rows, \%record);
}
}
unshift(#rows, \%headers);
my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));
for my $row_ref (#rows) {
$csv->print_hr(*STDOUT, $row_ref);
}
__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Output:
Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"
If your CSV format is 'complicated' - e.g. it contains commas, etc. - then use one of the Text::CSV modules. But if it isn't - and this is often the case - I tend to just work with split and join.
What's useful in your scenario, is that you can map key-values within a record quite easily using a regex. Then use a hash slice to output:
#!/usr/bin/env perl
use strict;
use warnings;
#set paragraph mode - records are blank line separated.
local $/ = "";
my #rows;
my %seen_header;
#read STDIN or files on command line, just like sed/grep
while ( <> ) {
#multi - line pattern, that matches all the key-value pairs,
#and then inserts them into a hash.
my %this_row = m/^(\w+):\.+ (.*)$/gm;
push ( #rows, \%this_row );
#add the keys we've seen to a hash, so we 'know' what we've seen.
$seen_header{$_}++ for keys %this_row;
}
#extract the keys, make them unique and ordered.
#could set this by hand if you prefer.
my #header = sort keys %seen_header;
#print the header row
print join ",", #header, "\n";
#iterate the rows
foreach my $row ( #rows ) {
#use a hash slice to select the values matching #header.
#the map is so any undefined values (missing keys) don't report errors, they
#just return blank fields.
print join ",", map { $_ // '' } #{$row}{#header},"\n";
}
This for you sample input, produces:
Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,
If you want to be really clever, then most of that initial building of the loop can be done with:
my #rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;
The problem then is - you would need to build up the 'headers' array still, and that means a bit more complicated:
$seen_header{$_}++ for map { keys %$_ } #rows;
It works, but I don't think it's as clear about what's happening.
However the core of your problem may be the file size - that's where you have a bit of a problem, because you need to read the file twice - first time to figure out which headings exist throughout the file, and then second time to iterate and print:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";
my %seen_header;
while ( <$input> ) {
$seen_header{$_}++ for m/^(\w+):/gm;
}
my #header = sort keys %seen_header;
#return to the start of file:
seek ( $input, 0, 0 );
while ( <$input> ) {
my %this_row = m/^(\w+):\.+ (.*)$/gm;
print join ",", map { $_ // '' } #{$this_row}{#header},"\n";
}
This will be slightly slower, as it'll have to read the file twice. But it won't use nearly as much memory footprint, because it isn't holding the whole file in memory.
Unless you know all your keys in advance, and you can just define them, you'll have to read the file twice.
This seems to work with the data you've given
use strict;
use warnings 'all';
my %data;
while ( <> ) {
next unless /^(\w+):\W*(.*\S)/;
push #{ $data{$1} }, $2;
}
use Data::Dump;
dd \%data;
output
{
Key1 => ["Value", "Different value"],
Key2 => ["Other value"],
Key3 => ["Maybe another value yet", "Invaluable"],
Key5 => ["Has no value at all"],
}
I have a big tab-separated file with duplicate products but with different colours and amounts. I’m trying to merge the data based on the key so that I end up with one product and the combined colours and amounts separated by a delimiter (comma in this case).
I'm using the Text::CSV module so that I have better control, and because it allows me to output the file with a different delimiters (from semicolon to pipe).
My question is, how do I merge the data properly? I don't want it simply to combine colours and amounts but remove duplicate values as well. So I was thinking a key/value with the Id/Amount and Id/Colour. But Id isn't unique so how do I do this? Do I create an array or use hashes?
Here is some sample source data, with the tab separators replaced by semicolons ;. Note that the marked row has no Colour so the empty value is not combined in the result.
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla;
101;Fruits;50020;Strawberry;500;Red;1;
101;Fruits;50020;Strawberry;1000;Red;1;
201;Vegetables;60090;Tomato;50;Green;1;
201;Vegetables;60080;Onion;1;Purple;1;
201;Vegetables;60090;Tomato;100;Red;1;
201;Vegetables;60010;Carrot;100;Purple;1;
201;Vegetables;60050;Broccoli;500;Green;1;
201;Vegetables;60050;Broccoli;1000;Green;1;
201;Vegetables;60090;Tomato;500;Yellow;1;
101;Fruits;50060;Apple;500;Green;1;
101;Fruits;50010;Grape;500;Red;1;
201;Vegetables;60010;Carrot;500;White;1;
201;Vegetables;60050;Broccoli;2000;Green;1;
201;Vegetables;60090;Tomato;1000;Red;1;
101;Fruits;50020;Strawberry;100;Red;1;
101;Fruits;50060;Apple;1000;Red;1;
201;Vegetables;60010;Carrot;250;Yellow;1;
101;Fruits;50010;Grape;100;White;1;
101;Fruits;50030;Banana;500;Yellow;1;
201;Vegetables;60010;Carrot;1000;Yellow;1;
101;Fruits;50030;Banana;1000;Green;1;
101;Fruits;50020;Strawberry;200;Red;1;
101;Fruits;50010;Grape;200;White;1;
201;Vegetables;60010;Carrot;50;Orange;1;
201;Vegetables;60080;Onion;2;White;1;
And the desired result I'm trying to get:
101;Fruits;50010;Grape;100,500,200;Red,White;1;
201;Vegetables;60090;Tomato;50,500,1000,10;Yellow,Green,Red;1;
101;Fruits;50060;Apple;500,1000;Red,Green;1;
201;Vegetables;60010;Carrot;250,50,500,1000,100;Orange,Yellow,White,Purple;1;
201;Vegetables;60050;Broccoli;1000,500,2000;Green;1;
101;Fruits;50020;Strawberry;100,1000,200,500;Red;1;
101;Fruits;50030;Banana;500,1000;Yellow,Green;1;
201;Vegetables;60080;Onion;2,1;White,Purple;1;
This is my script so far. It's not finished (and not working) because I'm not sure how to continue. I don't think this can work right because I'm trying to use the same key for different colours.
use strict;
use warnings;
use Text::CSV;
use List::MoreUtils 'uniq';
my $inputfile = shift || die "Give input and output names!\n";
my $outputfile = shift || die "Give output name!\n";
open my $infile, '<', $inputfile or die "Sourcefile in use / not found :$!\n";
open my $outfile, '>', $outputfile or die "Outputfile in use :$!\n";
binmode($outfile, ":encoding(utf8)");
my $csv_in = Text::CSV->new({binary => 1,sep_char => ";",eol => $/});
my $csv_out = Text::CSV->new({binary => 1,sep_char => "|",always_quote => 1,eol => $/}); #,quote_null => 0 #
my %data;
while (my $elements = $csv_in->getline($infile)){
my $id = $elements->[2];
push #{ $data{$id} }, \#elements;
}
for my $id ( sort keys %data ){
my $set = $data{$id};
my #elements = #{ $set->[0] };
$elements[4] = join ',', uniq map { $_->[4] } #$set;
$elements[5] = join ',', uniq map { $_->[5] } #$set;
$csv_in->combine(#$elements);
$csv_out->print($outfile, $elements);
}
Edit: I'm using data::dumper for testing but eventually want it written to a file.
Hashes deal with unique keys. As you've correctly surmised - if you 'overwrite' colour, then ... the old value is replaced.
But hashes can contain array(ref)s. So you can do:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $id = 50010;
my %hash;
$hash{$id}{'colour'} = [ "red", "green", "blue" ];
push( #{ $hash{$id}{'colour'} }, "orange" );
print Dumper \%hash;
This'll work, provided you don't have any duplicates for the colours. (e.g. there's only one line for White Grapes with that ID.).
You may have to post-process with join, to turn the array into a string.
Or as an alternative, you could concatenate the colours onto the existing:
if ( defined $hash->{$id}->{colour} ) {
$hash->{$id}->{colour} .= ",$colour";
}
I would also note - I'm unclear what you're doing with $elements->[10] because there aren't 10 columns. I would also strongly suggest not using generic names for variables - like %hash - because it's just a bad habit to get into. Vague variable names are bad style and whilst it's largely academic when you're looking at a small chunk of code, it pays to get into the habit of making it clear what you can expect to be in a particular variable. (Especially true if it's not clear of data types)
I don't have time to write a proper commentary, but this program seems to do what you need. It uses the uniq function from the List::MoreUtils modules. It isn't a core module and so may need installing. I trust that it's not important what order the Amounts and Colours appear in the combined fields?
use strict;
use warnings;
use List::MoreUtils 'uniq';
print scalar <DATA>;
my %data;
while (<DATA>) {
chomp;
my #fields = split /;/;
my $id = $fields[2];
push #{ $data{$id} }, \#fields;
}
for my $id ( sort keys %data ) {
my $set = $data{$id};
my #fields = #{ $set->[0] };
$fields[4] = join ',', uniq map { $_->[4] } #$set;
$fields[5] = join ',', uniq map { $_->[5] } #$set;
print join(';', #fields, ''), "\n";
}
__DATA__
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla;
101;Fruits;50020;Strawberry;500;Red;1;
101;Fruits;50020;Strawberry;1000;Red;1;
201;Vegetables;60090;Tomato;50;Green;1;
201;Vegetables;60080;Onion;1;Purple;1;
201;Vegetables;60090;Tomato;100;Red;1;
201;Vegetables;60010;Carrot;100;Purple;1;
201;Vegetables;60050;Broccoli;500;Green;1;
201;Vegetables;60050;Broccoli;1000;Green;1;
201;Vegetables;60090;Tomato;500;Yellow;1;
101;Fruits;50060;Apple;500;Green;1;
101;Fruits;50010;Grape;500;Red;1;
201;Vegetables;60010;Carrot;500;White;1;
201;Vegetables;60050;Broccoli;2000;Green;1;
201;Vegetables;60090;Tomato;1000;Red;1;
101;Fruits;50020;Strawberry;100;Red;1;
101;Fruits;50060;Apple;1000;Red;1;
201;Vegetables;60010;Carrot;250;Yellow;1;
101;Fruits;50010;Grape;100;White;1;
101;Fruits;50030;Banana;500;Yellow;1;
201;Vegetables;60010;Carrot;1000;Yellow;1;
101;Fruits;50030;Banana;1000;Green;1;
101;Fruits;50020;Strawberry;200;Red;1;
101;Fruits;50010;Grape;200;White;1;
201;Vegetables;60010;Carrot;50;Orange;1;
201;Vegetables;60080;Onion;2;White;1;
output
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla;
101;Fruits;50010;Grape;500,100,200;Red,White;1;
101;Fruits;50020;Strawberry;500,1000,100,200;Red;1;
101;Fruits;50030;Banana;500,1000;Yellow,Green;1;
101;Fruits;50060;Apple;500,1000;Green,Red;1;
201;Vegetables;60010;Carrot;100,500,250,1000,50;Purple,White,Yellow,Orange;1;
201;Vegetables;60050;Broccoli;500,1000,2000;Green;1;
201;Vegetables;60080;Onion;1,2;Purple,White;1;
201;Vegetables;60090;Tomato;50,100,500,1000;Green,Red,Yellow;1;
I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!
I have two files.
One composed of a unique list while the other one is a redundant list of name with the age.
for example
File1: File2:
Gaia Gaia 3
Matt Matt 12
Jane Gaia 89
Reuben 4
My aim is to match File1 and File2 and to retrieve the highest age for each name.
So far I have written the below code.
The bit that do not work quite well is: when the same key is found in the hash, print the bigger value.
Any suggestion/comment is welcome!
Thanks!!
#!/usr/bin/perl -w
use strict;
open (FILE1, $ARGV[0] )|| die "unable to open arg1\n"; #Opens first file for comparison
open (FILE2, $ARGV[1])|| die "unable to open arg2\n"; #2nd for comparison
my #not_red = <FILE1>;
my #exonslength = <FILE2>;
#2) Produce an Hash of File2. If the key is already in the hash, keep the couple key- value with the highest value. Otherwise, next.
my %hash_doc2;
my #split_exons;
my $key;
my $value;
foreach my $line (#exonslength) {
#split_exons = split "\t", $line;
#hash_doc2 {$split_exons[0]} = ($split_exons[1]);
if (exists $hash_doc2{$split_exons[0]}) {
if ( $hash_doc2{$split_exons[0]} > values %hash_doc2) {
$hash_doc2{$split_exons[0]} = ($split_exons[1]);
} else {next;}
}
}
#3) grep the non redundant list of gene from the hash with the corresponding value
my #a = grep (#not_red,%hash_doc2);
print "#a\n";
Do you need to keep all the values? If not, you can only keep the max value:
#split_exons = split "\t", $line;
if (exists $hash_doc2{$slit_exons[0]}
and $hash_doc2{$slit_exons[0]} < $split_exons[1]) {
$hash_doc2{$split_exons[0]} = $split_exons[1];
}
You code does not keep all the values, either. You cannot store an array into a hash value, you have to store a reference. Adding a new value to an array can by done by push:
push #{ $hash_doc2{$split_exons[0]} }, $split_exons[1];
Your use of numeric comparison against values is also not doing what you think. The < operator imposes a scalar context, so values returns the number of values. Another option would be to store the values sorted and always ask for the highest value:
$hash_doc2{$split_exons[0]} = [ sort #{ $hash_doc2{$split_exons[0]} }, $split_exons[1] ];
# max for $x is at $hash_doc2{$x}[-1]
Instead of reading in the whole of file2 into an array (which will be bad if it's big), you could loop through and process the data file line by line:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dumper;
open( my $nameFh, '<', $ARGV[0]);
open( my $dataFh, '<', $ARGV[1]);
my $dataHash = {};
my $processedHash = {};
while(<$dataFh>){
chomp;
my ( $name, $age ) = split /\s+/, $_;
if(! defined($dataHash->{$name}) or $dataHash->{$name} < $age ){
$dataHash->{$name} = $age
}
}
while(<$nameFh>){
chomp;
$processedHash->{$_} = $dataHash->{$_} if defined $dataHash->{$_};
}
print Dumper($processedHash);