Printing all HTML Tables with certain string for multiple files (perl) - perl

I am trying to print all the HTML tables containing the string "kcat" for each xml file in a directory but I am having some trouble. Note that each file in the directory (named kcat_tables) has at least one HTML table with kcat in it. I am running this program on an ubuntu virtual machine. Here is my code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Slurp;
use Path::Iterator::Rule;
use HTML::TableExtract;
use utf8::all;
my #papers_dir_path = qw(/home/bob/kinase/kcat_tables);
my $rule = Path::Iterator::Rule->new;
$rule->name("*.nxml");
$rule->skip_dirs(".");
my $xml;
my $it = $rule->iter(#papers_dir_path);
while ( my $file = $it->() ) {
$xml = read_file($file);
my $te = HTML::TableExtract->new();
$te->parse($xml);
foreach my $ts ( $te->tables ) {
if ( $ts =~ /kcat/i ) {
print "Table (", join( ',', $ts->coords ), "):\n";
foreach my $row ( $ts->rows ) {
print join( ',', #$row ), "\n";
}
}
}
}
Any ideas on how I should fix this? Thanks in advance! Also, I am fairly new to the PERL language so a simple, comprehensible answer would be very much appreciated.

You cannot apply a regex to an object, as you do in:
if ( $ts =~ /kcat/i ) {
I'd suggest, parsing the tables in 'tree' mode. For this, you'd have to install two additional perl modules: HTML::TreeBuilder and HTML::ElementTable. Enable it like this:
use HTML::TableExtract 'tree';
Here's the fixed while loop:
while ( my $file = $it->() ) {
$xml = read_file($file);
my $te = HTML::TableExtract->new();
$te->parse($xml);
foreach my $ts ( $te->tables ) {
my $tree = $ts->tree or die $!;
if ( $tree->as_text =~ /kcat/i ) {
print "Table (", join( ',', $ts->coords ), "):\n";
# update 18.2.2015: pretty print the table
foreach my $row ($ts->rows) {
print join ' | ', map {sprintf "%22s", $_->as_text} #{$row};
print "\n";
# which is the same as
# foreach my $cell (#${$row}) { do something with $cell->as_text }
}
}
}
}
$tree is an HTML::ElementTable object. The code above works with your sample.

Related

How to concatenate values for duplicate hash keys in Perl?

I know it is not possible to have duplicate keys in a hash, but this is what my data looks like:
Key Value
SETUP_FACE_PROT great
SETUP_FACE_PROT great2
SETUP_FACE_PROT great3
SETUP_FACE_PROT great3
SETUP_ARM_PROT arm
SETUP_FOOT_PROT foot
SETUP_FOOT_PROT foot2
SETUP_HEAD_PROT goggle
I would like to concatenate values for repeated keys, separated by a * character. For example, this is what I want the output to look like:
SETUP_FACE_PROT'=great*great2*great3',
SETUP_ARM_PROT='arm',
SETUP_FOOT_PROT='foot*foot2',
SETUP_HEAD_PROT='google'
This is how I've tried to solve the problem so far:
foreach my $key ( sort keys %stuff )
{
print "$key=\'", join( "*", #{ $stuff{$key} } ), "\'\n";
}
But instead of printing the result, how can I store it in a variable so that I can pass it to another subroutine? I'm trying to create a new string that looks like this:
$newstring="
SETUP_FACE_PROT='great*great2*great3',
SETUP_ARM_PROT='arm',
SETUP_FOOT_PROT='foot*foot2',
SETUP_HEAD_PROT='google' "
You can't duplicate keys, you can create a hash of arrays.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %stuff;
while (<DATA>) {
my ( $key, $value ) = split;
push( #{ $stuff{$key} }, $value );
}
print Dumper \%stuff;
foreach my $key ( sort keys %stuff ) {
print "$key=\'", join( "*", #{ $stuff{$key} } ), "\'\n";
}
__DATA__
SETUP_FACE_PROT great
SETUP_FACE_PROT great2
SETUP_FACE_PROT great3
SETUP_FACE_PROT great3
SETUP_ARM_PROT arm
SETUP_FOOT_PROT foot
SETUP_FOOT_PROT foot2
SETUP_HEAD_PROT goggle
Edit:
Turning it into a string as requested:
my $results;
foreach my $key ( sort keys %stuff ) {
$results .= "$key=\'". join( "*", #{ $stuff{$key} } ). "\'\n";
}
print $results;
Or perhaps using print still with a filehandle:
my $results;
open ( my $output, '>', \$results );
foreach my $key ( sort keys %stuff ) {
print {$output} "$key=\'", join( "*", #{ $stuff{$key} } ), "\'\n";
}
close ( $output );
print $results;
At last i got an answer doing this.
use Data::Dumper;
my %stuff;
use Text::CSV;
my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
my $filenamex = 'duplicate2.csv';
$checkstring ='';
open(my $datab, '<', $filenamex) or die "Could not open '$filename' $!\n";
$i=1;
my %datan;
while (my $linea = <$datab>)
{
chomp $linea;
#fieldsx = split ",",$linea;
$key = $fieldsx[0];
$value = $fieldsx[1];
# print $key;
push( #{ $stuff{$key} }, $value );
}
foreach my $key ( sort keys %stuff )
{
$checkstring = $checkstring.','.$key.'='. join( "*", #{ $stuff{$key} } );
}
print $checkstring;

Perl : Need to append two columns if the ID's are repeating

If id gets repeated I am appending app1, app2 and printing it once.
Input:
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
Output:
id|Name|app1|app2
1|abc|234,265|231,321|
2|xyz|123|215|
3|asd|213|235|
Output I'm getting:
id|Name|app1|app2
1|abc|234,231|
2|xyz|123,215|
1|abc|265,321|
3|asd|213,235|
My Code:
#! usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
my $counter = 0;
my %RepeatNumber;
my $pos=0;
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
open(FH, '<', join('', $basedir, $file)) || die $!;
my $line = readline(FH);
unless ($counter) {
chomp $line;
print OUTFILE $line;
print OUTFILE "\n";
}
while ($line = readline(FH)) {
chomp $line;
my #obj = split('\|',$line);
if($RepeatNumber{$obj[0]}++) {
my $str1= join("|",$obj[0]);
my $str2=join(",",$obj[2],$obj[3]);
print OUTFILE join("|",$str1,$str2);
print OUTFILE "\n";
}
}
This should do the trick:
use strict;
use warnings;
my $file_in = "doctor.txt";
open (FF, "<$file_in");
my $temp = <FF>; # remove first line
my %out;
while (<FF>)
{
my ($id, $Name, $app1, $app2) = split /\|/, $_;
$out{$id}[0] = $Name;
push #{$out{$id}[1]}, $app1;
push #{$out{$id}[2]}, $app2;
}
foreach my $key (keys %out)
{
print $key, "|", $out{$key}[0], "|", join (",", #{$out{$key}[1]}), "|", join (",", #{$out{$key}[2]}), "\n";
}
EDIT
To see what the %out contains (in case it's not clear), you can use
use Data::Dumper;
and print it via
print Dumper(%out);
I'd tackle it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use 5.14.0;
my %stuff;
#extract the header row.
#use the regex to remove the linefeed, because
#we can't chomp it inline like this.
#works since perl 5.14
#otherwise we could just chomp (#header) later.
my ( $id, #header ) = split( /\|/, <DATA> =~ s/\n//r );
while (<DATA>) {
#turn this row into a hash of key-values.
my %row;
( $id, #row{#header} ) = split(/\|/);
#print for diag
print Dumper \%row;
#iterate each key, and insert into $row.
foreach my $key ( keys %row ) {
push( #{ $stuff{$id}{$key} }, $row{$key} );
}
}
#print for diag
print Dumper \%stuff;
print join ("|", "id", #header ),"\n";
#iterate ids in the hash
foreach my $id ( sort keys %stuff ) {
#join this record by '|'.
print join('|',
$id,
#turn inner arrays into comma separated via map.
map {
my %seen;
#use grep to remove dupes - e.g. "abc,abc" -> "abc"
join( ",", grep !$seen{$_}++, #$_ )
} #{ $stuff{$id} }{#header}
),
"\n";
}
__DATA__
id|Name|app1|app2
1|abc|234|231|
2|xyz|123|215|
1|abc|265|321|
3|asd|213|235|
This is perhaps a bit overkill for your application, but it should handle arbitrary column headings and arbitary numbers of duplicates. I'll coalesce them though - so the two abc entries don't end up abc,abc.
Output is:
id|Name|app1|app2
1|abc|234,265|231,321
2|xyz|123|215
3|asd|213|235
Another way of doing it which doesn't use a hash (in case you want to be more memory efficient), my contribution lies under the opens:
#!/usr/bin/perl
use strict;
use warnings;
my $basedir = 'E:\Perl\Input\\';
my $file ='doctor.txt';
open(OUTFILE, '>', 'E:\Perl\Output\DoctorOpFile.csv') || die $!;
select(OUTFILE);
open(FH, '<', join('', $basedir, $file)) || die $!;
print(scalar(<FH>));
my #lastobj = (undef);
foreach my $obj (sort {$a->[0] <=> $b->[0]}
map {chomp;[split('|')]} <FH>) {
if(defined($lastobj[0]) &&
$obj[0] eq $lastobj[0])
{#lastobj = (#obj[0..1],
$lastobj[2].','.$obj[2],
$lastobj[3].','.$obj[3])}
else
{
if($lastobj[0] ne '')
{print(join('|',#lastobj),"|\n")}
#lastobj = #obj[0..3];
}
}
print(join('|',#lastobj),"|\n");
Note that split, without it's third argument ignores empty elements, which is why you have to add the last bar. If you don't do a chomp, you won't need to supply the bar or the trailing hard return, but you would have to record $obj[4].

Perl - need to store the column values into hash

I want to create a hash with column header as hash key and column values as hash values in Perl.
For example, if my csv file has the following data:
A,B,C,D,E
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15 ...
I want to create a hash as follows:
A=> 1,6,11
B=>2,7,12
c=>3,8,13 ...
So that just by using the header name I can use that column values.
Is there a way in PERL to do this? Please help me.
I was able to store required column values as array using the following script
use strict;
use warnings;
open( IN, "sample.csv" ) or die("Unable to open file");
my $wanted_column = "A";
my #cells;
my #colvalues;
my $header = <IN>;
my #column_names = split( ",", $header );
my $extract_col = 0;
for my $header_line (#column_names) {
last if $header_line =~ m/$wanted_column/;
$extract_col++;
}
while ( my $row = <IN> ) {
last unless $row =~ /\S/;
chomp $row;
#cells = split( ",", $row );
push( #colvalues, $cells[$extract_col] );
}
my $sizeofarray = scalar #colvalues;
print "Size of the coulmn= $sizeofarray";
But I want to do this to all my column.I guess Hash of arrays will be the best solution but I dont know how to implement it.
Text::CSV is a useful helper module for this sort of thing.
use strict;
use warnings;
use Text::CSV;
use Data::Dumper;
my %combined;
open( my $input, "<", "sample.csv" ) or die("Unable to open file");
my $csv = Text::CSV->new( { binary => 1 } );
my #headers = #{ $csv->getline($input) };
while ( my $row = $csv->getline($input) ) {
for my $header (#headers) {
push( #{ $combined{$header} }, shift(#$row) );
}
}
print Dumper \%combined;
Since you requested without a module - you can use split but you need to bear in mind the limitations. CSV format allows for things like commas nested in quotes. split won't handle that case very well.
use strict;
use warnings;
use Data::Dumper;
my %combined;
open( my $input, "<", "sample.csv" ) or die("Unable to open file");
my $line = <$input>;
chomp ( $line );
my #headers = split( ',', $line );
while (<$input>) {
chomp;
my #row = split(',');
for my $header (#headers) {
push( #{ $combined{$header} }, shift(#row) );
}
}
print Dumper \%combined;
Note: Both of these will effectively ignore any extra columns that don't have headers. (And will get confused by duplicate column names).
Another solution by using for loop :
use strict;
use warnings;
my %data;
my #columns;
open (my $fh, "<", "file.csv") or die "Can't open the file : ";
while (<$fh>)
{
chomp;
my #list=split(',', $_);
for (my $i=0; $i<=$#list; $i++)
{
if ($.==1) # collect the columns, if its first line.
{
$columns[$i]=$list[$i];
}
else #collect the data, if its not the first line.
{
push #{$data{$columns[$i]}}, $list[$i];
}
}
}
foreach (#columns)
{
local $"="\,";
print "$_=>#{$data{$_}}\n";
}
Output will be like this :
A=>1,6,11
B=>2,7,12
C=>3,8,13
D=>4,9,14
E=>5,10,15

Output .Resx From .CS using perl script

.CS contains string within double quotes and I am trying to extract these strings into .resx file.
The existing code output the .resx but with only one string whereas .CS file contains more than one strings in quotes.
Can you please provide any reference to achieve this?
use strict;
use warnings;
use File::Find;
use XML::Writer;
use Cwd;
#user input: [Directory]
my $wrkdir = getcwd;
system "attrib -r /s";
print "Processing $wrkdir\n";
find( \&recurse_src_path, $wrkdir );
sub recurse_src_path
{
my $file = $File::Find::name;
my $fname = $_;
my #lines;
my $line;
if ( ( -f $file ) && ( $file =~ /.*\.cs$/i ) )
{
print "..";
open( FILE, $file ) || die "Cannot open $file:\n$!";
while ( $line = <FILE> )
{
if ( $line =~ s/\"(.*?)\"/$1/m )
{
chomp $line;
push( #lines, $line );
my $nl = '0';
my $dataIndent;
my $output = new IO::File(">Test.resx");
#binmode( $output, ":encoding(utf-8)" );
my $writer = XML::Writer->new(
OUTPUT => $output,
DATA_MODE => 1,
DATA_INDENT => 2
);
$writer->xmlDecl("utf-8");
$writer->startTag('root');
foreach my $r ($line)
{
print "$1\n";
$writer->startTag( 'data', name => $_ );
$writer->startTag('value');
$writer->characters($1);
$writer->endTag('value');
$writer->startTag('comment');
$writer->characters($1);
$writer->endTag('comment');
$writer->endTag('data');
}
$writer->endTag('root');
$writer->end;
$output->close();
}
}
close FILE;
}
}
Use the /g regex modifier. For example:
use strict;
use warnings;
my $cs_string = '
// Imagine this is .cs code here
system "attrib -r /s";
print "Processing $wrkdir\n";
find( \&recurse_src_path, $wrkdir );
';
while ($cs_string =~ /\"(.*)\"/g) {
print "Found quoted string: '$1'\n"
}
;
See also: http://perldoc.perl.org/perlrequick.html#Matching-repetitions
You might also want to look at File-Slurp to read your .cs code into a single Perl scalar, trusting that your .cs file is not too large.
Finally combine this with your existing code to get the .resx output format.

Renaming files using hash table in perl

I have made a perl code which is shown below. Here what I am trying to do is first get input from a text file consisting of a HTTP URL with a Title.
thus the first regex is the title and the second regex fetches the id from inside the URL.
All these values are inserted into the hash table %myfilenames().
So this hash table has key as the URL id, and value as the Title. Everything till here works fine, now I have a set of files on my computer which have the ID in their name which we extracted from the URL.
What I want to do is that if the ID is there in the hash table, then the files name should change to the value assigned to the ID. Now the output at the print statement in the last function is correct but I am unable to rename the files. I tried many things, but nothing works. Can someone help please.
example stuff:
URL: https://abc.com/789012 <--- ID
Value (new Title) : ABC
file name on computer = file-789012 <---- ID
new file name = ABC
My code:
use File::Slurp;
use File::Copy qw(move);
open( F, '<hadoop.txt' );
$key = '';
$value = '';
%myfilenames = ();
foreach (<F>) {
if ( $_ =~ /Lecture/ ) {
$value = $_;
}
if ( $_ =~ /https/ ) {
if ( $_ =~ /\d{6}/ ) {
$key = $&;
}
}
if ( !( $value eq '' || $key eq '' ) ) {
#print "$key\t\t$value";
$myfilenames{$key} = $value;
$key = '';
$value = '';
}
}
#while ( my ( $k, $v ) = each %myfilenames ) { print "$k $v\n"; }
my #files = read_dir 'C:\\inputfolder';
for (#files) {
if ( $_ =~ /\d{6}/ ) {
$oldval = $&;
}
$newval = $myfilenames{$oldval};
chomp($newval);
print $_ , "\t\t$newval" . "\n";
$key = '';
}
You probably didn't prepend the path to the file names. The following works for me (on a Linux box):
#!/usr/bin/perl
use warnings;
use strict;
use File::Slurp qw{ read_dir };
my $dir = 0;
mkdir $dir;
open my $FH, '>', "$dir/$_" for 123456, 234567;
my $key = my $value = q();
my %myfilenames = ();
for (<DATA>) {
chomp;
$value = $_ if /Lecture/;
$key = $1 if /https/ and /(\d{6})/;
if ($value ne q() and $key ne q()) {
$myfilenames{$key} = $value;
$key = $value = q();
}
}
my #files = read_dir($dir);
for (#files) {
if (/(\d{6})/) {
my $oldval = $1;
my $newval = $myfilenames{$oldval};
rename "$dir/$oldval", "$dir/$newval";
}
}
__DATA__
Lecture A1
https://123456
# Comment
Lecture A2
https://234567