How to concatenate values for duplicate hash keys in Perl? - perl

I know it is not possible to have duplicate keys in a hash, but this is what my data looks like:
Key Value
SETUP_FACE_PROT great
SETUP_FACE_PROT great2
SETUP_FACE_PROT great3
SETUP_FACE_PROT great3
SETUP_ARM_PROT arm
SETUP_FOOT_PROT foot
SETUP_FOOT_PROT foot2
SETUP_HEAD_PROT goggle
I would like to concatenate values for repeated keys, separated by a * character. For example, this is what I want the output to look like:
SETUP_FACE_PROT'=great*great2*great3',
SETUP_ARM_PROT='arm',
SETUP_FOOT_PROT='foot*foot2',
SETUP_HEAD_PROT='google'
This is how I've tried to solve the problem so far:
foreach my $key ( sort keys %stuff )
{
print "$key=\'", join( "*", #{ $stuff{$key} } ), "\'\n";
}
But instead of printing the result, how can I store it in a variable so that I can pass it to another subroutine? I'm trying to create a new string that looks like this:
$newstring="
SETUP_FACE_PROT='great*great2*great3',
SETUP_ARM_PROT='arm',
SETUP_FOOT_PROT='foot*foot2',
SETUP_HEAD_PROT='google' "

You can't duplicate keys, you can create a hash of arrays.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %stuff;
while (<DATA>) {
my ( $key, $value ) = split;
push( #{ $stuff{$key} }, $value );
}
print Dumper \%stuff;
foreach my $key ( sort keys %stuff ) {
print "$key=\'", join( "*", #{ $stuff{$key} } ), "\'\n";
}
__DATA__
SETUP_FACE_PROT great
SETUP_FACE_PROT great2
SETUP_FACE_PROT great3
SETUP_FACE_PROT great3
SETUP_ARM_PROT arm
SETUP_FOOT_PROT foot
SETUP_FOOT_PROT foot2
SETUP_HEAD_PROT goggle
Edit:
Turning it into a string as requested:
my $results;
foreach my $key ( sort keys %stuff ) {
$results .= "$key=\'". join( "*", #{ $stuff{$key} } ). "\'\n";
}
print $results;
Or perhaps using print still with a filehandle:
my $results;
open ( my $output, '>', \$results );
foreach my $key ( sort keys %stuff ) {
print {$output} "$key=\'", join( "*", #{ $stuff{$key} } ), "\'\n";
}
close ( $output );
print $results;

At last i got an answer doing this.
use Data::Dumper;
my %stuff;
use Text::CSV;
my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
my $filenamex = 'duplicate2.csv';
$checkstring ='';
open(my $datab, '<', $filenamex) or die "Could not open '$filename' $!\n";
$i=1;
my %datan;
while (my $linea = <$datab>)
{
chomp $linea;
#fieldsx = split ",",$linea;
$key = $fieldsx[0];
$value = $fieldsx[1];
# print $key;
push( #{ $stuff{$key} }, $value );
}
foreach my $key ( sort keys %stuff )
{
$checkstring = $checkstring.','.$key.'='. join( "*", #{ $stuff{$key} } );
}
print $checkstring;

Related

Perl : Need to append two columns if the ID's are repeating

If id gets repeated I am appending app1, app2 and printing it once.
Input:
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
Output:
id|Name|app1|app2
1|abc|234,265|231,321|
2|xyz|123|215|
3|asd|213|235|
Output I'm getting:
id|Name|app1|app2
1|abc|234,231|
2|xyz|123,215|
1|abc|265,321|
3|asd|213,235|
My Code:
#! usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
my $counter = 0;
my %RepeatNumber;
my $pos=0;
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
open(FH, '<', join('', $basedir, $file)) || die $!;
my $line = readline(FH);
unless ($counter) {
chomp $line;
print OUTFILE $line;
print OUTFILE "\n";
}
while ($line = readline(FH)) {
chomp $line;
my #obj = split('\|',$line);
if($RepeatNumber{$obj[0]}++) {
my $str1= join("|",$obj[0]);
my $str2=join(",",$obj[2],$obj[3]);
print OUTFILE join("|",$str1,$str2);
print OUTFILE "\n";
}
}
This should do the trick:
use strict;
use warnings;
my $file_in = "doctor.txt";
open (FF, "<$file_in");
my $temp = <FF>; # remove first line
my %out;
while (<FF>)
{
my ($id, $Name, $app1, $app2) = split /\|/, $_;
$out{$id}[0] = $Name;
push #{$out{$id}[1]}, $app1;
push #{$out{$id}[2]}, $app2;
}
foreach my $key (keys %out)
{
print $key, "|", $out{$key}[0], "|", join (",", #{$out{$key}[1]}), "|", join (",", #{$out{$key}[2]}), "\n";
}
EDIT
To see what the %out contains (in case it's not clear), you can use
use Data::Dumper;
and print it via
print Dumper(%out);
I'd tackle it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use 5.14.0;
my %stuff;
#extract the header row.
#use the regex to remove the linefeed, because
#we can't chomp it inline like this.
#works since perl 5.14
#otherwise we could just chomp (#header) later.
my ( $id, #header ) = split( /\|/, <DATA> =~ s/\n//r );
while (<DATA>) {
#turn this row into a hash of key-values.
my %row;
( $id, #row{#header} ) = split(/\|/);
#print for diag
print Dumper \%row;
#iterate each key, and insert into $row.
foreach my $key ( keys %row ) {
push( #{ $stuff{$id}{$key} }, $row{$key} );
}
}
#print for diag
print Dumper \%stuff;
print join ("|", "id", #header ),"\n";
#iterate ids in the hash
foreach my $id ( sort keys %stuff ) {
#join this record by '|'.
print join('|',
$id,
#turn inner arrays into comma separated via map.
map {
my %seen;
#use grep to remove dupes - e.g. "abc,abc" -> "abc"
join( ",", grep !$seen{$_}++, #$_ )
} #{ $stuff{$id} }{#header}
),
"\n";
}
__DATA__
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
This is perhaps a bit overkill for your application, but it should handle arbitrary column headings and arbitary numbers of duplicates. I'll coalesce them though - so the two abc entries don't end up abc,abc.
Output is:
id|Name|app1|app2
1|abc|234,265|231,321
2|xyz|123|215
3|asd|213|235
Another way of doing it which doesn't use a hash (in case you want to be more memory efficient), my contribution lies under the opens:
#!/usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
select(OUTFILE);
open(FH, '<', join('', $basedir, $file)) || die $!;
print(scalar(<FH>));
my #lastobj = (undef);
foreach my $obj (sort {$a->[0] <=> $b->[0]}
map {chomp;[split('|')]} <FH>) {
if(defined($lastobj[0]) &&
$obj[0] eq $lastobj[0])
{#lastobj = (#obj[0..1],
$lastobj[2].','.$obj[2],
$lastobj[3].','.$obj[3])}
else
{
if($lastobj[0] ne '')
{print(join('|',#lastobj),"|\n")}
#lastobj = #obj[0..3];
}
}
print(join('|',#lastobj),"|\n");
Note that split, without it's third argument ignores empty elements, which is why you have to add the last bar. If you don't do a chomp, you won't need to supply the bar or the trailing hard return, but you would have to record $obj[4].

Perl - need to store the column values into hash

I want to create a hash with column header as hash key and column values as hash values in Perl.
For example, if my csv file has the following data:
A,B,C,D,E
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15 ...
I want to create a hash as follows:
A=> 1,6,11
B=>2,7,12
c=>3,8,13 ...
So that just by using the header name I can use that column values.
Is there a way in PERL to do this? Please help me.
I was able to store required column values as array using the following script
use strict;
use warnings;
open( IN, "sample.csv" ) or die("Unable to open file");
my $wanted_column = "A";
my #cells;
my #colvalues;
my $header = <IN>;
my #column_names = split( ",", $header );
my $extract_col = 0;
for my $header_line (#column_names) {
last if $header_line =~ m/$wanted_column/;
$extract_col++;
}
while ( my $row = <IN> ) {
last unless $row =~ /\S/;
chomp $row;
#cells = split( ",", $row );
push( #colvalues, $cells[$extract_col] );
}
my $sizeofarray = scalar #colvalues;
print "Size of the coulmn= $sizeofarray";
But I want to do this to all my column.I guess Hash of arrays will be the best solution but I dont know how to implement it.
Text::CSV is a useful helper module for this sort of thing.
use strict;
use warnings;
use Text::CSV;
use Data::Dumper;
my %combined;
open( my $input, "<", "sample.csv" ) or die("Unable to open file");
my $csv = Text::CSV->new( { binary => 1 } );
my #headers = #{ $csv->getline($input) };
while ( my $row = $csv->getline($input) ) {
for my $header (#headers) {
push( #{ $combined{$header} }, shift(#$row) );
}
}
print Dumper \%combined;
Since you requested without a module - you can use split but you need to bear in mind the limitations. CSV format allows for things like commas nested in quotes. split won't handle that case very well.
use strict;
use warnings;
use Data::Dumper;
my %combined;
open( my $input, "<", "sample.csv" ) or die("Unable to open file");
my $line = <$input>;
chomp ( $line );
my #headers = split( ',', $line );
while (<$input>) {
chomp;
my #row = split(',');
for my $header (#headers) {
push( #{ $combined{$header} }, shift(#row) );
}
}
print Dumper \%combined;
Note: Both of these will effectively ignore any extra columns that don't have headers. (And will get confused by duplicate column names).
Another solution by using for loop :
use strict;
use warnings;
my %data;
my #columns;
open (my $fh, "<", "file.csv") or die "Can't open the file : ";
while (<$fh>)
{
chomp;
my #list=split(',', $_);
for (my $i=0; $i<=$#list; $i++)
{
if ($.==1) # collect the columns, if its first line.
{
$columns[$i]=$list[$i];
}
else #collect the data, if its not the first line.
{
push #{$data{$columns[$i]}}, $list[$i];
}
}
}
foreach (#columns)
{
local $"="\,";
print "$_=>#{$data{$_}}\n";
}
Output will be like this :
A=>1,6,11
B=>2,7,12
C=>3,8,13
D=>4,9,14
E=>5,10,15

Adding a new element to an array in a hash of arrays

I want to read and save the content of a file in a hash of array. The first column of each row would be the key.
Then, I want to read the files in a directory and add the file name to the end of the array according to the key!
file ($file_info)
AANB John male
S00V Sara female
SBBA Anna female
files in the directory:
AANB.txt
SBBA.txt
S00V.txt
expected output:
AANB John male AANB.txt
S00V Sara female S00V.txt
SBBA Anna female SBBA.txt
Here's the script itself:
#!/usr/bin/perl
use strict;
use warnings;
my %all_samples=();
my $file_info = $ARGV[0];
open(FH, "<$file_info");
while(<FH>) {
chomp;
my #line = split("\t| ", $_);
push(#{$all_samples{$line[0]}}, $_);
}
my $dir = ".";
opendir(DIR, $dir);
my #files = grep(/\.txt$/,readdir(DIR));
closedir(DIR);
foreach my $file (#files) {
foreach my $k (keys %all_samples){
foreach my $element (#{ $all_samples{$k} }){
my #element = split(' ', $element);
if ($file =~ m/$element[0]/) {
push #{$all_samples{$element}}, $file;
}
else {
next;
}
}
}
}
foreach my $k (keys %all_samples) {
foreach my $element (#{ $all_samples{$k} }) {
print $element,"\n";
}
}
But the output is not what I expected
AANB John male
SBBA.txt1
S00V Sara female
SBBA Anna female
S00V.txt1
AANB.txt1
I think that
my #element = split(' ', $element);
if ($file =~ m/$element[0]/) {
push #{$all_samples{$element}}, $file;
}
Is not doing the right thing, so $all_samples{$element}} is a new arrayref. You're printing six one element arrays rather than three two element arrays.
But then it doesn't help that you're printing the array elements one per line.
I think that your final section should look more like this:
foreach my $k (keys %all_samples) {
print join( "\t", #{ $all_samples{$k} } ) . "\n"
}
In general, I think that you're overcomplicating this script. Here's how I would write it:
#!/usr/bin/perl
use strict;
use warnings;
my $all_samples={};
while(<>) {
chomp;
# Note that I'm using variable names here to document
# The format of the file being read. This makes for
# easier trouble-shooting -- if a column is missing,
# It's easier to tell that $file_base_name shouldn't be
# 'Anna' than that $line[0] should not be 'Anna'.
my ( $file_base_name, $given_name, $sex ) = split("\t", $_);
push(#{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
}
my $dir = ".";
opendir(DIR, $dir);
my #files = grep(/\.txt$/,readdir(DIR));
closedir(DIR);
FILE: foreach my $file (#files) {
BASE: foreach my $base (keys %{$all_samples}){
next BASE unless( $file =~ /$base/ );
push #{$all_samples->{$base}}, $file;
}
}
foreach my $k (keys %{$all_samples} ) {
print join( "\t", #{ $all_samples->{$k} } ) . "\n";
}
I prefer hashrefs to hashes, simply because I tend to deal with nested data structures -- I'm simply more used to seeing $all_samples->{$k} than $all_samples{$k}... more importantly, I'm using the full power of the arrayref, meaning that I'm not having to re-split the array that's already been split once.
G. Cito brings up an interesting point: why did I use
push(#{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
Rather than
push(#{$all_samples->{$file_base_name} }, [ $file_base_name, $given_name, $sex ] );
There's nothing syntactically wrong with the latter, but it wasn't what I was trying to accomplish:
Let's look at what $all_samples->{$base} would look like after
push #{$all_samples->{$base}}, $file;
If the original push had been this:
push(#{$all_samples->{$file_base_name} }, [ $file_base_name, $given_name, $sex ] );
#{$all_samples->{$base}} would look like this:
(
[ $file_base_name, $given_name, $sex ],
$file
)
If instead, we use
push(#{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
#{$all_samples->{$base}} looks like this after push #{$all_samples->{$base}}, $file:
(
$file_base_name,
$given_name,
$sex,
$file
)
For instance:
(
"AANB",
"John",
"male",
"AANB.txt"
)
So when we print the array:
print join( "\t", #{ $all_samples->{$k} } ) . "\n";
Will print
AANB John male AANB.txt
Here is somewhat simpler way of creating the hash of arrays - reading from DATA here instead of a file only for convenience:
#!perl
use strict ;
use warnings ;
use Data::Dumper ;
my $samples ;
while (<DATA>){
chomp;
map { $samples->{$_->[0]} = [#$_[1..2]] } [ split/\s+/ ];
}
push #{$samples->{$_}} , $_.".txt" for keys %$samples ;
print Dumper \$samples ;
__DATA__
AANB John male
S00V Sara female
SBBA Anna female
Since the filenames are known, you can just construct them from strings. Or is that not possible ? Perhaps confirming they exist with a file test (see perldoc -f -X) before pushing onto the array would avoid creating bad data but still allow you to build the entries this way.

Printing all HTML Tables with certain string for multiple files (perl)

I am trying to print all the HTML tables containing the string "kcat" for each xml file in a directory but I am having some trouble. Note that each file in the directory (named kcat_tables) has at least one HTML table with kcat in it. I am running this program on an ubuntu virtual machine. Here is my code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Slurp;
use Path::Iterator::Rule;
use HTML::TableExtract;
use utf8::all;
my #papers_dir_path = qw(/home/bob/kinase/kcat_tables);
my $rule = Path::Iterator::Rule->new;
$rule->name("*.nxml");
$rule->skip_dirs(".");
my $xml;
my $it = $rule->iter(#papers_dir_path);
while ( my $file = $it->() ) {
$xml = read_file($file);
my $te = HTML::TableExtract->new();
$te->parse($xml);
foreach my $ts ( $te->tables ) {
if ( $ts =~ /kcat/i ) {
print "Table (", join( ',', $ts->coords ), "):\n";
foreach my $row ( $ts->rows ) {
print join( ',', #$row ), "\n";
}
}
}
}
Any ideas on how I should fix this? Thanks in advance! Also, I am fairly new to the PERL language so a simple, comprehensible answer would be very much appreciated.
You cannot apply a regex to an object, as you do in:
if ( $ts =~ /kcat/i ) {
I'd suggest, parsing the tables in 'tree' mode. For this, you'd have to install two additional perl modules: HTML::TreeBuilder and HTML::ElementTable. Enable it like this:
use HTML::TableExtract 'tree';
Here's the fixed while loop:
while ( my $file = $it->() ) {
$xml = read_file($file);
my $te = HTML::TableExtract->new();
$te->parse($xml);
foreach my $ts ( $te->tables ) {
my $tree = $ts->tree or die $!;
if ( $tree->as_text =~ /kcat/i ) {
print "Table (", join( ',', $ts->coords ), "):\n";
# update 18.2.2015: pretty print the table
foreach my $row ($ts->rows) {
print join ' | ', map {sprintf "%22s", $_->as_text} #{$row};
print "\n";
# which is the same as
# foreach my $cell (#${$row}) { do something with $cell->as_text }
}
}
}
}
$tree is an HTML::ElementTable object. The code above works with your sample.

Retrieving values matching the same ID with perl

This is a simple problem but cannot find any working solution for it. I have 2 files and the first file holds all the ID that I am interested in, for example "tomato", "cucumber", but also the ones I am not interested in, which hold no value in the second file. The second file has the following data structure
tomato red
tomato round
tomato sweet
cucumber green
cucumber bitter
cucumber watery
What I need to get is a file containing all the IDs with all the matching values from the second file, everything tab-seperated, like this:
tomato red round sweet
cucumber green bitter watery
What I did so far is create a hash out of the IDs in the first file:
while (<FILE>) {
chomp;
#records = split "\t", $_;
{%hash = map { $records[0] => 1 } #records};
}
And this for the second file:
while (<FILE2>) {
chomp;
#records2 = split "\t", $_;
$key, $value = $records2[0], $records2[1];
$data{$key} = join("\t", $value);
}
close FILE;
foreach my $key ( keys %data )
{
print OUT "$key\t$data{$key}\n"
if exists $hash{$key}
}
Would be grateful for some simple solution for combining all the values matching the same ID! :)
for th first file:
while (<FILE>) {
chomp;
#records = split "\t", $_;
$hash{$records[0]} = 1;
}
and for the second:
while (<FILE2>) {
chomp;
#records2 = split "\t", $_;
($key,$value) = #records2;
$data{$key} = [] unless exists $data{$key};
push #{$data{$key}}, $value;
}
close FILE;
foreach my $key ( keys %data ) {
print OUT $key."\t".join("\t", #{$data{$key}})."\n" if exists $hash{$key};
}
This seems to do what is needed
use strict;
use warnings;
my %data;
open my $fh, '<', 'file1.txt' or die $!;
while (<$fh>) {
$data{$1} = {} if /([^\t]+)/;
}
open $fh, '<', 'file2.txt' or die $!;
while (<$fh>) {
$data{$1}{$2}++ if /^(.+?)\t(.+?)$/ and exists $data{$1};
}
while ( my ($key, $values) = each %data) {
print join("\t", $key, keys %$values), "\n";
}
output
tomato sweet round red
cucumber green watery bitter
It's easier if you read the data mapping first.
Also, if you are using Perl, you should consider from the get-go leveraging one its main strengths - CPAN libraries. For example, the reading in of the file is as simple as read_file() from File::Slurp; instead of having to open/close the file yourself and then run a while(<>) loop.
use File::Slurp;
my %data;
my #data_lines = File::Slurp::read_file($filename2);
chomp(#data_lines);
foreach my $line (#data_lines) { # Improved version from CyberDem0n's answer
my ($key, $value) = split("\t", $line);
$data{$key} ||= []; # Make sure it's an array reference if first time
push #{ $data{$key} }, $value;
}
my #id_lines = File::Slurp::read_file($filename1);
chomp(#id_lines);
foreach my $id (#id_lines) {
print join("\t", ( $id, #{ $data{$id} } ) )."\n";
}
A slightly more hacky but a bit shorter code adds the ID to the list of values in the data hash from the get go:
my #data_lines = File::Slurp::read_file($filename2);
chomp(#data_lines);
foreach my $line (#data_lines) { # Improved version from CyberDem0n's answer
my ($key, $value) = split("\t", $line);
$data{$key} ||= [ $id ]; # Add the ID for printing
push #{ $data{$key} }, $value;
}
my #id_lines = File::Slurp::read_file($filename1);
chomp(#id_lines);
foreach my $id (#id_lines) {
print join("\t", #{ $data{$id} } ) ."\n"; # ID already in %data!
}