Merging multiple text files based on one column in perl

Merging multiple text files based on one column in perl - perl

I am new to Perl and this is my first question in this blog hopefully to be solved.
I am having some text (10-18) files in a folder, I want to read all the files and merge all the files which are having common variables in the Names column along with their Area column for all the files.
For example :
file 1.txt
Name sim Area Cas
aa 12 54 222
ab 23 2 343
aaa 32 34 34
bba 54 76 65
file 2.txt
Name Sim Area Cas
ab 45 45 56
abc 76 87 98
bba 54 87 87
aaa 33 43 54
file 3.txt
Name Sim Area Cas
aaa 43 54 65
ab 544 76 87
ac 54 65 76
Output should be
Name Area1 Area2 area3
aaa 32 43 54
ab 23 45 76
Can anyone help regarding this. I am a very new to Perl and struggling to use Hashes.
I have tried this so far
use strict;
use warnings;
my $input_dir = 'C:/Users/Desktop/mr/';
my $output_dir = 'C:/Users/Desktop/test_output/';
opendir SD, $input_dir || die 'cannot open the input directory $!';
my #files_list = readdir(SD);
closedir(SD);
foreach my $each_file(#files_list)
{
if ($each_file!~/^\./)
{
#print "$each_file\n"; exit;
open (IN, $input_dir.$each_file) || die 'cannot open the inputfile $!';
open (OUT, ">$output_dir$each_file") || die 'cannot open the outputfile $!';
print OUT "Name\tArea\n";
my %hash; my %area; my %remaning_data;
while(my $line=<IN>){
chomp $line;
my #line_split=split(/\t/,$line);
# print $_,"\n" foreach(#line_split);
my $name=$line_split[0];
my $area=$line_split[1];
}
}
}
Can anyone provide guidance on how to complete this?
Thanks in advance.

perl -lane '$X{$F[0]}.=" $F[2]";END{foreach(keys %X){if(scalar(split / /,$X{$_})==4){print $_,$X{$_}}}}' file1 file2 file3
tested:
> perl -lane '$X{$F[0]}.=" $F[2]";END{foreach(keys %X){if(scalar(split / /,$X{$_})==4){print $_,$X{$_}}}}' file1 file2 file3
ab 2 45 76
aaa 34 43 54

#!/usr/bin/perl
use strict;
use warnings;
my $inputDir = '/tmp/input';
my $outputDir = '/tmp/out';
opendir my $readdir, $inputDir || die 'cannot open the input directory $!';
my #files_list = readdir($readdir);
closedir($readdir);
my %areas;
foreach my $file (#files_list) {
next if $file =~ /^\.+$/; # skip ..
open (my $fh, "<$inputDir/$file");
while (my $s = <$fh>) {
if ($s =~ /(\w+)\s+[\d\.]+\s+([\d\.]+)/) {
my ($name,$area) = ($1, $2); # parse name and area
push(#{$areas{$name}}, $area); # add area to the hash of arrays
}
}
close ($fh);
}
open (my $out, ">$outputDir/outfile");
foreach my $key (keys %areas) {
print $out "$key ";
print $out join " ", #{$areas{$key}};
print $out "\n";
}
close ($out);

Related

Parsing a file by summing up different columns of each row separated by blank line

I have a file input as below;
#
volume stats
start_time 1
length 2
--------
ID
0x00a,1,2,3,4
0x00b,11,12,13,14
0x00c,21,22,23,24
volume stats
start_time 2
length 2
--------
ID
0x00a,31,32,33,34
0x00b,41,42,43,44
0x00c,51,52,53,54
volume stats
start_time 3
length 2
--------
ID
0x00a,61,62,63,64
0x00b,71,72,73,74
0x00c,81,82,83,84
#
I need output in below format;
1 33 36 39 42
2 123 126 129 132
3 213 216 219 222
#
Below is my code;
#!/usr/bin/perl
use strict;
use warnings;
#use File::Find;
# Define file names and its location
my $input = $ARGV[0];
# Grab the vols stats for different intervals
open (INFILE,"$input") or die "Could not open sample.txt: $!";
my $date_time;
my $length;
my $col_1;
my $col_2;
my $col_3;
my $col_4;
foreach my $line (<INFILE>)
{
if ($line =~ m/start/)
{
my #date_fields = split(/ /,$line);
$date_time = $date_fields[1];
}
if ($line =~ m/length/i)
{
my #length_fields = split(/ /,$line);
$length = $length_fields[1];
}
if ($line =~ m/0[xX][0-9a-fA-F]+/)
{
my #volume_fields = split(/,/,$line);
$col_1 += $volume_fields[1];
$col_2 += $volume_fields[2];
$col_3 += $volume_fields[3];
$col_4 += $volume_fields[4];
#print "$col_1\n";
}
if ($line =~ /^$/)
{
print "$date_time $col_1 $col_2 $col_3 $col_4\n";
$col_1=0;$col_2=0;$col_3=0;$col_4=0;
}
}
close (INFILE);
#
my code result is;
1
33 36 39 42
2
123 126 129 132
#
BAsically, for each time interval, it just sums up the columns for all the lines and displays all the columns against each time interval.

$/ is your friend here. Try setting it to '' to enable paragraph mode (separating your data by blank lines).
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = '';
while ( <> ) {
my ( $start ) = m/start_time\s+(\d+)/;
my ( $length ) = m/length\s+(\d+)/;
my #row_sum;
for ( m/(0x.*)/g ) {
my ( $key, #values ) = split /,/;
for my $index ( 0..$#values ) {
$row_sum[$index] += $values[$index];
}
}
print join ( "\t", $start, #row_sum ), "\n";
}
Output:
1 33 36 39 42
2 123 126 129 132
3 213 216 219 222
NB - using tab stops for output. Can use sprintf if you need more flexible options.
I would also suggest that instead of:
my $input = $ARGV[0];
open (my $input_fh, '<', $input) or die "Could not open $input: $!";
You would be better off with:
while ( <> ) {
Because <> is the magic filehandle in perl, that - opens files specified on command line, and reads them one at a time, and if there isn't one, reads STDIN. This is just like how grep/sed/awk do it.
So you can still run this with scriptname.pl sample.txt or you can do curl http://somewebserver/sample.txt | scriptname.pl or scriptname.pl sample.txt anothersample.txt moresample.txt
Also - if you want to open the file yourself, you're better off using lexical vars and 3 arg open:
open ( my $input_fh, '<', $ARGV[0] ) or die $!;
And you really shouldn't ever be using 'numbered' variables like $col_1 etc. If there's numbers, then an array is almost always better.

Basically, a block begins with start_time and ends with a line of of whitespace. If instead end of block is always assured to be an empty line, you can change the test below.
It helps to use arrays instead of variables with integer suffixes.
When you hit the start of a new block, record the start_time value. When you hit a stat line, update column sums, and when you hit a line of whitespace, print the column sums, and clear them.
This way, you keep your program's memory footprint proportional to the longest line of input as apposed to the largest block of input. In this case, there isn't a huge difference, but, in real life, there can be. Your original program was reading the entire file into memory as a list of lines which would really cause your program's memory footprint to balloon when used with large input sizes.
#!/usr/bin/env perl
use strict;
use warnings;
my $start_time;
my #cols;
while (my $line = <DATA>) {
if ( $line =~ /^start_time \s+ ([0-9]+)/x) {
$start_time = $1;
}
elsif ( $line =~ /^0x/ ) {
my ($id, #vals) = split /,/, $line;
for my $i (0 .. $#vals) {
$cols[ $i ] += $vals[ $i ];
}
}
elsif ( !($line =~ /\S/) ) {
# guard against the possibility of
# multiple blank/whitespace lines between records
if ( #cols ) {
print join("\t", $start_time, #cols), "\n";
#cols = ();
}
}
}
# in case there is no blank/whitespace line after last record
if ( #cols ) {
print join("\t", $start_time, #cols), "\n";
}
__DATA__
volume stats
start_time 1
length 2
--------
ID
0x00a,1,2,3,4
0x00b,11,12,13,14
0x00c,21,22,23,24
volume stats
start_time 2
length 2
--------
ID
0x00a,31,32,33,34
0x00b,41,42,43,44
0x00c,51,52,53,54
volume stats
start_time 3
length 2
--------
ID
0x00a,61,62,63,64
0x00b,71,72,73,74
0x00c,81,82,83,84
Output:
1 33 36 39 42
2 123 126 129 132
3 213 216 219 222

When I run your code, I get warnings:
Use of uninitialized value $date_time in concatenation (.) or string
I fixed it by using \s+ instead of / /.
I also added a print after your loop in case the file does not end with a blank line.
Here is minimally-changed code to produce your desired output:
use strict;
use warnings;
# Define file names and its location
my $input = $ARGV[0];
# Grab the vols stats for different intervals
open (INFILE,"$input") or die "Could not open sample.txt: $!";
my $date_time;
my $length;
my $col_1;
my $col_2;
my $col_3;
my $col_4;
foreach my $line (<INFILE>)
{
if ($line =~ m/start/)
{
my #date_fields = split(/\s+/,$line);
$date_time = $date_fields[1];
}
if ($line =~ m/length/i)
{
my #length_fields = split(/\s+/,$line);
$length = $length_fields[1];
}
if ($line =~ m/0[xX][0-9a-fA-F]+/)
{
my #volume_fields = split(/,/,$line);
$col_1 += $volume_fields[1];
$col_2 += $volume_fields[2];
$col_3 += $volume_fields[3];
$col_4 += $volume_fields[4];
}
if ($line =~ /^$/)
{
print "$date_time $col_1 $col_2 $col_3 $col_4\n";
$col_1=0;$col_2=0;$col_3=0;$col_4=0;
}
}
print "$date_time $col_1 $col_2 $col_3 $col_4\n";
close (INFILE);
__END__
1 33 36 39 42
2 123 126 129 132
3 213 216 219 222

How to read a file in Perl till it matches certain string/value

I have one text file which has entries like
123
123
234
456
789
654
123
123
123
I am trying to write a perl script which opens and reads through the file (while doing so, it has to ignore second 123 and read through till next 123 is repeated):
Desired output:
123 # Keep
123 # Ignore
234 # Keep
456 # Keep
789 # Keep
654 # Keep
123 # Keep and stop here

#!/usr/bin perl
use strict;
use warnings;
# open my $file, '<', 'in.txt' or die $!; # If you're reading in from a file use this
my %seen;
while (<DATA>) {
chomp;
$seen{$_}++;
next if $seen{$_} == 2;
print "$_\n";
last if $seen{$_} > 2;
}
__DATA__
123
123
234
456
789
654
123
123
123
---output---
123
234
456
789
654
123

How do print second column elements in row separated by comma(,) if the first element of column is same

The input what I am handling is as follows.
Q9NRG9 15
Q9NRG9 160
Q9NRG9 56
Q9NRG9 89
Q16613 26
Q16613 63
Q16613 102
O95477 19
O95477 91
O95477 78
O95477 86
O95477 16
O95477 203
O95477 66
P78363 18
P78363 159
P78363 88
I want output as
Q9NRG9 15,160,56,89
Q16613 26,63,102
O95477 78,86,16,203,66
I tried with perl program, but I couldn't get correct output what I want.

Using perl from the command line:
perl -lane '
push #{ $h{$F[0]} }, $F[1]
}{
$" = ",";
print "$_ #{ $h{$_} }" for keys %h
' file
O95477 19,91,78,86,16,203,66
Q9NRG9 15,160,56,89
P78363 18,159,88
Q16613 26,63,102
To maintain the order, you can do:
perl -lane '
$k{$F[0]}++ or push #r, $F[0];
push #{ $h{$F[0]} }, $F[1]
}{
$" = ",";
print "$_ #{ $h{$_} }" for #r
' file

Try this:
open (FILE, "text.txt") or die "cannot open file".$!;
my %data;
while(<FILE>){
chomp($_);
my ($key, $value) = split(/\s+/,$_);
push(#{$data{$key}}, $value);
}
foreach (keys %data){
print $_." ".join(",",#{$data{$_}})."\n";
}

Perl Format the Output in nice way

I have written a perl code for processing file 'Output.txt' which has below Content.
new.example.com 28
new.example.com 28
example.com 28
example.com 29
example.com 29
example.com 29
example.com 29
orginal.com 28
orginal.com 29
orginal.com 30
orginal.com 31
expand.com 31
And file 'domain.txt' has list of domain Names which i need to match against File 'Output.txt'
new.example.com
example.com
orginal.com
test.com
new.com
I could manage to write PERL code like this
#!/usr/bin/perl
use strict;
open(LOGFILE,"Output.txt") or die("Could not open log file.");
my $domain_name = 'domain.txt' ;
open(DOM, $domain_name);
my #r_contents = <LOGFILE>;
close(LOGFILE);
while(<DOM>) {
chomp;
my $line = $_;
my #lowercase = map { lc } #r_contents;
my #grepNames = grep /^$line/, #lowercase;
foreach (#grepNames) {
if ( grep /^$line/, #lowercase ) {
$domains{lc($_)}++ ; }
}
}
close(DOM) ;
foreach my $domain (sort keys %domains) {
my %seen ;
($Dname, $WeekNum) = split(/\s+/, $domain);
my #array1 = grep { ! $seen{ $_ }++ } $WeekNum;
push #array2, #array1;
my #array4 = "$domains{$domain} $domain" ;
push #matrix,#array4 ;
}
printf "%-10s %-25s %-25s\n", 'DoaminName', "Week $array2[0]" ,"Week $array2[1]","Week $array2[2]";
print " #matrix \n";
current Output looks like this.
DoaminName Week 28 week29 week30 week 31
2 new.example.com 35
1 example.com 28
4 example.com 29
1 orginal.com 28
1 orginal.com 29
1 orginal.com 30
1 orginal.com 31
But i trying re-write the perl code to print the output like this .Please help me to correct the code.
Domain/WeekNumber Week28 Week29 Week30 Week31
new.example.com 2 No No No
example.com 1 4 NO NO
orginal.com 1 1 1 1

This produces the desired output, only I also sorted the output.
If you have long website names, add tabs in the code as necessary.
#!/usr/bin/perl
use warnings;
use strict;
open my $fh, "<", "Output.txt" or die "could not open Output.txt\n";
my %data;
my %weeks;
while(<$fh>){
chomp;
$_=~ m/(.*?)\s++(\d++)/;
my $site=$1;
my $week=$2;
$weeks{$week}++;
$data{$site}{$week}++;
}
print "Domain/WeekNumber";
for my $week (sort {$a <=> $b} keys %weeks){
print"\tWeek$week";
}
print"\n";
for my $site(sort keys %data){
print"$site\t\t";
for my $week (sort {$a <=> $b} keys %weeks){
unless(defined $data{$site}{$week} ){
print"NO\t";
}else{
print $data{$site}{$week} ."\t";
}
}
print"\n";
}
close $fh;

How can I open a Unicode file with Perl?

I'm using osql to run several sql scripts against a database and then I need to look at the results file to check if any errors occurred. The problem is that Perl doesn't seem to like the fact that the results files are Unicode.
I wrote a little test script to test it and the output comes out all warbled:
$file = shift;
open OUTPUT, $file or die "Can't open $file: $!\n";
while (<OUTPUT>) {
print $_;
if (/Invalid|invalid|Cannot|cannot/) {
push(#invalids, $file);
print "invalid file - $inputfile - schedule for retry\n";
last;
}
}
Any ideas? I've tried decoding using decode_utf8 but it makes no difference. I've also tried to set the encoding when opening the file.
I think the problem might be that osql puts the result file in UTF-16 format, but I'm not sure. When I open the file in textpad it just tells me 'Unicode'.
Edit: Using perl v5.8.8
Edit: Hex dump:
file name: Admin_CI.User.sql.results
mime type:
0000-0010: ff fe 31 00-3e 00 20 00-32 00 3e 00-20 00 4d 00 ..1.>... 2.>...M.
0000-0020: 73 00 67 00-20 00 31 00-35 00 30 00-30 00 37 00 s.g...1. 5.0.0.7.
0000-0030: 2c 00 20 00-4c 00 65 00-76 00 65 00-6c 00 20 00 ,...L.e. v.e.l...
0000-0032: 31 00 1.

The file is presumably in UCS2-LE (or UTF-16 format).
C:\Temp> notepad test.txt
C:\Temp> xxd test.txt
0000000: fffe 5400 6800 6900 7300 2000 6900 7300 ..T.h.i.s. .i.s.
0000010: 2000 6100 2000 6600 6900 6c00 6500 2e00 .a. .f.i.l.e...
When opening such file for reading, you need to specify the encoding:
#!/usr/bin/perl
use strict; use warnings;
my ($infile) = #ARGV;
open my $in, '<:encoding(UCS-2le)', $infile
or die "Cannot open '$infile': $!";
Note that the fffe at the beginning is the BOM.

The answer is in the documentation for open, which also points you to perluniintro. :)
open my $fh, '<:encoding(UTF-16LE)', $file or die ...;
You can get a list of the names of the encodings that your perl supports:
% perl -MEncode -le "print for Encode->encodings(':all')"
After that, it's up to you to find out what the file encoding is. This is the same way you'd open any file with an encoding different than the default, whether it's one defined by Unicode or not.
We have a chapter in Effective Perl Programming that goes through the details.

Try opening the file with an IO layer specified, e.g. :
open OUTPUT, "<:encoding(UTF-8)", $file or die "Can't open $file: $!\n";
See perldoc open for more on this.

#
# -----------------------------------------------------------------------------
# Reads a file returns a sting , if second param is utf8 returns utf8 string
# usage:
# ( $ret , $msg , $str_file )
# = $objFileHandler->doReadFileReturnString ( $file , 'utf8' ) ;
# or
# ( $ret , $msg , $str_file )
# = $objFileHandler->doReadFileReturnString ( $file ) ;
# -----------------------------------------------------------------------------
sub doReadFileReturnString {
my $self = shift;
my $file = shift;
my $mode = shift ;
my $msg = {} ;
my $ret = 1 ;
my $s = q{} ;
$msg = " the file : $file does not exist !!!" ;
cluck ( $msg ) unless -e $file ;
$msg = " the file : $file is not actually a file !!!" ;
cluck ( $msg ) unless -f $file ;
$msg = " the file : $file is not readable !!!" ;
cluck ( $msg ) unless -r $file ;
$msg .= "can not read the file $file !!!";
return ( $ret , "$msg ::: $! !!!" , undef )
unless ((-e $file) && (-f $file) && (-r $file));
$msg = '' ;
$s = eval {
my $string = (); #slurp the file
{
local $/ = undef;
if ( defined ( $mode ) && $mode eq 'utf8' ) {
open FILE, "<:utf8", "$file "
or cluck("failed to open \$file $file : $!");
$string = <FILE> ;
die "did not find utf8 string in file: $file"
unless utf8::valid ( $string ) ;
}
else {
open FILE, "$file "
or cluck "failed to open \$file $file : $!" ;
$string = <FILE> ;
}
close FILE;
}
$string ;
};
if ( $# ) {
$msg = $! . " " . $# ;
$ret = 1 ;
$s = undef ;
} else {
$ret = 0 ; $msg = "ok for read file: $file" ;
}
return ( $ret , $msg , $s ) ;
}
#eof sub doReadFileReturnString

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Merging multiple text files based on one column in perl - perl

perl -lane '$X{$F[0]}.=" $F[2]";END{foreach(keys %X){if(scalar(split / /,$X{$_})==4){print $_,$X{$_}}}}' file1 file2 file3 tested: > perl -lane '$X{$F[0]}.=" $F[2]";END{foreach(keys %X){if(scalar(split / /,$X{$_})==4){print $_,$X{$_}}}}' file1 file2 file3 ab 2 45 76 aaa 34 43 54

Related

Parsing a file by summing up different columns of each row separated by blank line

How to read a file in Perl till it matches certain string/value

How do print second column elements in row separated by comma(,) if the first element of column is same

Perl Format the Output in nice way

How can I open a Unicode file with Perl?

Categories

Resources