Merge txt files in Perl, but modify them before, leaving original files untouched - perl

I've already posted a question and fixed the problem in my code, but now my "specification has changed" so to say, and now I need to change some things about it.
Here's a code that takes all .txt files from the current directory, cuts off the last line of the first file, the first and the last line of every following file and the first line of the last file and writes everything in a new file (in other words: merge all files, deleting header and footer so that the new file has only one header and one footer).
#!/usr/bin/perl
use warnings;
use Cwd;
use Tie::File;
use Tie::Array;
my $cwd = getcwd();
my $buff = '';
# Get all files in cwd.
my #files = grep ( -f ,<*.txt>);
# Cut off header and footer of $files [1] to $files[$#files-1],
# but only footer of $files[0] and header of $#files[$#files]
for (my $i = 0; $i <= $#files; $i++) {
print 'Opening ' . $files[$i] . "\n";
tie (#lines, Tie::File, $files[$i]) or die "can't update $file: $!";
splice #lines, 0, 1 unless $i == 0;
splice #lines, -1, 1 unless $i == $#files;
untie #lines;
open (file, "<", $files[$i]) or die "can't update $file: $!";
while (my $line =<file>) {
$buff .= $line;
}
close file;
}
# Write the buffer to a new file.
my $allfilename = $cwd.'/Trace.txt';
print 'Writing all files into new file: ' . $allfilename . "\n";
open $outputfile, ">".$allfilename or die "can't write to new file $outputfile: $!";
# Write the buffer into the output file.
print $outputfile $buff;
close $outputfile;
My problem: I don't want to change the original files, but my code does exactly that and I'm having trouble coming up with a solution. The simplest way (simple meaning not having to change too much code) would now be, to just copy all the files to a tmp directory, messing around with them and leaving the original files untouched. Problem: a simple use of dircopy doesn't do it for me, since you have to give a new tmp dir to the dircopy function, making the code only usable for Windows or UNIX systems (but I need portability).
The next approach would be to make use of the File::Temp module but I'm really having trouble with the docs on this one.
Does anybody have a good idea on this one?

I suspected that you didn't really want your original files modified when I answered your previous question.
I don't understand why you've gone back to accumulating all the text in a buffer before printing it, or why you've removed use strict, which is essential to any well-written Perl code.
Here's my previous solution modified to leave the input data untouched.
use strict;
use warnings;
use Tie::File;
my #files = grep -f, glob '*.txt';
my $all_filename = 'Trace.txt';
open my $out_fh, '>', $all_filename or die qq{Unable to open "$all_filename" for output: $!};
for my $i ( 0 .. $#files ) {
my $file = $files[$i];
next if $file eq $all_filename;
print "Opening $file\n";
tie my #lines, 'Tie::File', $file or die qq{Can't open "$file": $!};
my ($start, $end) = (0, $#lines);
++$start unless $i == 0;
--$end unless $i == $#files;
print $out_fh "$_\n" for #lines[$start..$end];
}
close $out_fh;

#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $outfile = 'Trace.txt';
# Get all files in cwd.
my #files = grep { -f && $_ ne $outfile } <*.txt>;
open my $outfh, '>', $outfile;
for my $file (#files) {
my #lines = do { local #ARGV = $file; <> };
shift #lines unless $file eq $files[0];
pop #lines unless $file eq $files[-1];
print $outfh #lines;
}

Just do not use Tie::File. Or is there a reason you do this, for example all your files together do not fit your memory or something?
A version very close to your current implementation would be something like the following (untested) code. It just skips the part where you update the file, just to reopen and read it afterwards. (Note that this is certainly not a very effective or overly elegant way to do this, it just sticks to your implementation as close as possible)
#!/usr/bin/perl
use warnings;
use Cwd;
# use Tie::File;
# use Tie::Array;
my $cwd = getcwd();
my $buff = '';
# Get all files in cwd.
my #files = grep ( -f ,<*.txt>);
# Cut off header and footer of $files [1] to $files[$#files-1],
# but only footer of $files[0] and header of $#files[$#files]
for (my $i = 0; $i <= $#files; $i++) {
print 'Opening ' . $files[$i] . "\n";
open (my $fh, "<", $files[$i]) or die "can't open $file for reading: $!";
my #lines = <$fh>;
splice #lines, 0, 1 unless $i == 0;
splice #lines, -1, 1 unless $i == $#files;
foreach my $line (#lines) {
$buff .= $line;
}
}
# Write the buffer to a new file.
my $allfilename = $cwd.'/Trace.txt';
print 'Writing all files into new file: ' . $allfilename . "\n";
open $outputfile, ">".$allfilename or die "can't write to new file $outputfile: $!";
# Write the buffer into the output file.
print $outputfile $buff;
close $outputfile;

Based on Miller's answer, but most suitable for large files.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $outfile = 'Trace.txt';
# Get all files in cwd.
my #files = grep { -f && $_ ne $outfile } <*.txt>;
open my $outfh, '>', $outfile;
my $counter = 0;
for my $file (#files) {
open my $fh, '<', $file;
my ($line, $prev) = ('', '');
my $l = 0;
while ($line = <$fh>) {
print $outfh $prev unless $l++ == 1 and $counter > 0;
$prev = $line;
}
$counter++;
print $outfh $prev if $counter == #files and $l > 0;
close $fh;
}

Related

perl to merge csv files removing the headings

I have several monthly reports in csv format in a folder. The csv files all have 8 common columns (with headings) . Using perl, I would like to merge these files together line by line.
say
file 1:
1,2,3,4,5,6,7,8,
a1,b1,c1,d1,e1,f1,g1,h1,
a1,b1,c1,d1,e1,f1,g1,h1,
a1,b1,c1,d1,e1,f1,g1,h1,
file 2:
1,2,3,4,5,6,7,8,
a2,b2,c2,d2,e2,f2,g2,h2,
a2,b2,c2,d2,e2,f2,g2,h2,
a2,b2,c2,d2,e2,f2,g2,h2,
I would like the output to look something like that (join the rows and remove the headings)
output:
1,2,3,4,5,6,7,8,
a1,b1,c1,d1,e1,f1,g1,h1,
a1,b1,c1,d1,e1,f1,g1,h1,
a1,b1,c1,d1,e1,f1,g1,h1,
a2,b2,c2,d2,e2,f2,g2,h2,
a2,b2,c2,d2,e2,f2,g2,h2,
a2,b2,c2,d2,e2,f2,g2,h2,
I have managed to save the names of the files in an array. but for some reason, I could not join them.
can you please help me figure out what is wrong with my code. I am quite new to perl.
#! C:Strawberry/perl/bin;
use feature ':5.12';
use strict;
use warnings;
my $data_directory = 'R:/testing_data/';
opendir( DIR, $data_directory ) or die "Could not open $data_directory $!\n";
my #files = grep {/_monthlyreport\.csv$/} readdir(DIR); #to get on the monthly reports csv files
foreach my $file (#files) {
open( HANR, "<", '$data_directory' . my $files ) or die "cannot open $files: $!"; #read handler
open( HANW, ">>", "G:/outputfile_script.csv" ) or die "error $! \n"; #write handler for creating new sorted files
my #lines = ();
#lines = <HANR>;
foreach my $line (#lines) {
chomp($line);
my $count++;
next unless $count; # skip header i.e the first line containing stock details
print HANW join $line, "\n";
}
my $count = -1;
close(HANW);
close(HANR);
}
closedir(DIR);
exit 0;
Your open statement to your input filehandle is malformed, and my $count++; is also broken.
I'd also recommend modernizing your code by using lexical file handles. The following is a cleaned up version of your code:
use feature ':5.12';
use strict;
use warnings;
use autodie;
my $data_directory = 'R:/testing_data/';
opendir my $dh, "$data_directory";
open my $outfh, ">>", "G:/outputfile_script.csv";
my $seenheader = 0;
while (my $file = readdir $dh) {
next unless $file =~ /_monthlyreport\.csv$/;
open my $infh, '<', "$data_directory/$file";
while (<$infh>) {
print $outfh $_ if $. > 1 || ! $seenheader++;
}
}
This line is wrong.
open(HANR ,"<",'$data_directory'.my $files) or die "cannot open $files: $!";
Should be
open(HANR ,"<","$data_directory".$files) or die "cannot open $files: $!";
Add a counter and stop printing if the counter equals 0;
#! C:Strawberry/perl/bin;
use feature ':5.12';
use strict;
use warnings;
my $data_directory = 'R:/testing_data/';
opendir(DIR,$data_directory) or die "Could not open $data_directory $!\n";
my #files = grep {/_monthlyreport\.csv$/} readdir(DIR); #to get on the monthly reports csv files
foreach my $file (#files) {
open(HANR ,"<",'$data_directory'.my $files) or die "cannot open $files: $!"; #read handler
open(HANW , ">>","G:/outputfile_script.csv") or die "error $! \n"; #write handler for creating new sorted files
my #lines=();
#lines=<HANR>;
my $i =0;
foreach my $line (#lines){
next if ($i==0) ;
chomp ($line) ;
my $count++;
next unless $count; # skip header i.e the first line containing stock details
print HANW join $line,"\n";
}
my $count= -1;
close(HANW);
close(HANR);
}
closedir(DIR);
exit 0;

perl + read multiple csv files + manipulate files + provide output_files

Apologies if this is a bit long winded, bu i really appreciate an answer here as i am having difficulty getting this to work.
Building on from this question here, i have this script that works on a csv file(orig.csv) and provides a csv file that i want(format.csv). What I want is to make this more generic and accept any number of '.csv' files and provide a 'output_csv' for each inputed file. Can anyone help?
#!/usr/bin/perl
use strict;
use warnings;
open my $orig_fh, '<', 'orig.csv' or die $!;
open my $format_fh, '>', 'format.csv' or die $!;
print $format_fh scalar <$orig_fh>; # Copy header line
my %data;
my #labels;
while (<$orig_fh>) {
chomp;
my #fields = split /,/, $_, -1;
my ($label, $max_val) = #fields[1,12];
if ( exists $data{$label} ) {
my $prev_max_val = $data{$label}[12] || 0;
$data{$label} = \#fields if $max_val and $max_val > $prev_max_val;
}
else {
$data{$label} = \#fields;
push #labels, $label;
}
}
for my $label (#labels) {
print $format_fh join(',', #{ $data{$label} }), "\n";
}
i was hoping to use this script from here but am having great difficulty putting the 2 together:
#!/usr/bin/perl
use strict;
use warnings;
#If you want to open a new output file for every input file
#Do it in your loop, not here.
#my $outfile = "KAC.pdb";
#open( my $fh, '>>', $outfile );
opendir( DIR, "/data/tmp" ) or die "$!";
my #files = readdir(DIR);
closedir DIR;
foreach my $file (#files) {
open( FH, "/data/tmp/$file" ) or die "$!";
my $outfile = "output_$file"; #Add a prefix (anything, doesn't have to say 'output')
open(my $fh, '>', $outfile);
while (<FH>) {
my ($line) = $_;
chomp($line);
if ( $line =~ m/KAC 50/ ) {
print $fh $_;
}
}
close($fh);
}
the script reads all the files in the directory and finds the line with this string 'KAC 50' and then appends that line to an output_$file for that inputfile. so there will be 1 output_$file for every inputfile that is read
issues with this script that I have noted and was looking to fix:
- it reads the '.' and '..' files in the directory and produces a
'output_.' and 'output_..' file
- it will also do the same with this script file.
I was also trying to make it dynamic by getting this script to work in any directory it is run in by adding this code:
use Cwd qw();
my $path = Cwd::cwd();
print "$path\n";
and
opendir( DIR, $path ) or die "$!"; # open the current directory
open( FH, "$path/$file" ) or die "$!"; #open the file
**EDIT::I have tried combining the versions but am getting errors.Advise greatly appreciated*
UserName#wabcl13 ~/Perl
$ perl formatfile_QforStackOverflow.pl
Parentheses missing around "my" list at formatfile_QforStackOverflow.pl line 13.
source dir -> /home/UserName/Perl
Can't use string ("/home/UserName/Perl/format_or"...) as a symbol ref while "strict refs" in use at formatfile_QforStackOverflow.pl line 28.
combined code::
use strict;
use warnings;
use autodie; # this is used for the multiple files part...
#START::Getting current working directory
use Cwd qw();
my $source_dir = Cwd::cwd();
#END::Getting current working directory
print "source dir -> $source_dir\n";
my $output_prefix = 'format_';
opendir my $dh, $source_dir; #Changing this to work on current directory; changing back
for my $file (readdir($dh)) {
next if $file !~ /\.csv$/;
next if $file =~ /^\Q$output_prefix\E/;
my $orig_file = "$source_dir/$file";
my $format_file = "$source_dir/$output_prefix$file";
# .... old processing code here ...
## Start:: This part works on one file edited for this script ##
#open my $orig_fh, '<', 'orig.csv' or die $!; #line 14 and 15 above already do this!!
#open my $format_fh, '>', 'format.csv' or die $!;
#print $format_fh scalar <$orig_fh>; # Copy header line #orig needs changeing
print $format_file scalar <$orig_file>; # Copy header line
my %data;
my #labels;
#while (<$orig_fh>) { #orig needs changing
while (<$orig_file>) {
chomp;
my #fields = split /,/, $_, -1;
my ($label, $max_val) = #fields[1,12];
if ( exists $data{$label} ) {
my $prev_max_val = $data{$label}[12] || 0;
$data{$label} = \#fields if $max_val and $max_val > $prev_max_val;
}
else {
$data{$label} = \#fields;
push #labels, $label;
}
}
for my $label (#labels) {
#print $format_fh join(',', #{ $data{$label} }), "\n"; #orig needs changing
print $format_file join(',', #{ $data{$label} }), "\n";
}
## END:: This part works on one file edited for this script ##
}
How do you plan on inputting the list of files to process and their preferred output destination? Maybe just have a fixed directory that you want to process all the cvs files, and prefix the result.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $source_dir = '/some/dir/with/cvs/files';
my $output_prefix = 'format_';
opendir my $dh, $source_dir;
for my $file (readdir($dh)) {
next if $file !~ /\.csv$/;
next if $file =~ /^\Q$output_prefix\E/;
my $orig_file = "$source_dir/$file";
my $format_file = "$source_dir/$output_prefix$file";
.... old processing code here ...
}
Alternatively, you could just have an output directory instead of prefixing the files. Either way, this should get you on your way.

how to count the number of specific characters through each line from file?

I'm trying to count the number of 'N's in a FASTA file which is:
>Header
AGGTTGGNNNTNNGNNTNGN
>Header2
AGNNNNNNNGNNGNNGNNGN
so in the end I want to get the count of number of 'N's and each header is a read so I want to make a histogram so I would at the end output something like this:
# of N's # of Reads
0 300
1 240
etc...
so there are 300 sequences or reads that have 0 number of 'N's
use strict;
use warnings;
my $file = shift;
my $output_file = shift;
my $line;
my $sequence;
my $length;
my $char_N_count = 0;
my #array;
my $count = 0;
if (!defined ($output_file)) {
die "USAGE: Input FASTA file\n";
}
open (IFH, "$file") or die "Cannot open input file$!\n";
open (OFH, ">$output_file") or die "Cannot open output file $!\n";
while($line = <IFH>) {
chomp $line;
next if $line =~ /^>/;
$sequence = $line;
#array = split ('', $sequence);
foreach my $element (#array) {
if ($element eq 'N') {
$char_N_count++;
}
}
print "$char_N_count\n";
}
Try this. I changed a few things like using scalar file handles. There are many ways to do this in Perl, so some people will have other ideas. In this case I used an array which may have gaps in it - another option is to store results in a hash and key by the count.
Edit: Just realised I'm not using $output_file, because I have no idea what you want to do with it :) Just change the 'print' at the end to 'print $out_fh' if your intent is to write to it.
use strict;
use warnings;
my $file = shift;
my $output_file = shift;
if (!defined ($output_file)) {
die "USAGE: $0 <input_file> <output_file>\n";
}
open (my $in_fh, '<', $file) or die "Cannot open input file '$file': $!\n";
open (my $out_fh, '>', $output_file) or die "Cannot open output file '$output_file': $!\n";
my #results = ();
while (my $line = <$in_fh>) {
next if $line =~ /^>/;
my $num_n = ($line =~ tr/N//);
$results[$num_n]++;
}
print "# of N's\t# of Reads\n";
for (my $i = 0; $i < scalar(#results) ; $i++) {
unless (defined($results[$i])) {
$results[$i] = 0;
# another option is to 'next' if you don't want to show the zero totals
}
print "$i\t\t$results[$i]\n";
}
close($in_fh);
close($out_fh);
exit;

Adding header of the first file to all the other split files in Perl

I need to add header of the first main file to all the split files. i.e I am able to get header for the 1st split file but i need it for all the split files, here I am splitting DAT file. Below is what i have done so for:
#!usr/bin/perl -w
my $chunksize = 25000000; # 25MB
my $filenumber = 0;
my $infile = "Test.dat";
my $outsize = 0;
my $eof = 0;
my $line = $_;
open INFILE, $infile;
open OUTFILE, ">outfile_".$filenumber.".dat";
while (<INFILE>) {
chomp;
if ($outsize > $chunksize) {
close OUTFILE;
$outsize = 0;
$filenumber++;
open (OUTFILE, ">outfile_".$filenumber.".dat")
or die "Can't open outfile_".$filenumber.".dat";
}
print OUTFILE "$_\n";
$outsize += length;
}
close INFILE;
You should always use warnings (in preference to the command-line -w) and use strict. That way many simple errors that you may otherwise have obverlooked will be flagged
Use the three-parameter form of open with lexical filehandles
Check the result of all open calls and flag errors containing the value of $! in a die string
Define constant values with the use constant pragma father than as Perl variables
The number of bytes printed to a filehandle can be evaluated using the tell function, so there is no need to keep your own count
To solve your specific problem, you should read and remember the first line of your input file, and print it to new output files every time they are opened
It is easier to keep track of the output files if you open them when you have new data to write and no open file, and close them when they are full or if you have reached the end of the input data
This program demonstrates the ideas and does what is required
use strict;
use warnings;
use constant INFILE => 'Test.dat';
use constant CHUNKSIZE => 25_000_000; # 25MB
open my $infh, '<', INFILE or die $!;
my $header = <$infh>;
my $outfh;
my $filenumber = 0;
while (my $line = <$infh>) {
unless ($outfh) {
my $outfile = "outfile_$filenumber.dat";
open $outfh, '>', $outfile or die "Can't open '$outfile': $!";
print { $outfh } $header;
$filenumber++;
}
print { $outfh } $line;
if (tell $outfh > CHUNKSIZE or eof $infh) {
close $outfh or die $!;
undef $outfh;
}
}
You need to store the header from the input file and print it every time a new file is opened:
use strict;
use warnings;
use autodie;
# initializations ...
open my $in, '<', $infile;
open my $out, '>', "outfile_${file_number}.dat";
my $header = <$in>; # Save the header...
chomp $header; # ... not strictly necessary
while ( <$in> ) {
chomp; # Not strictly necessary
if ( $outsize > $chunksize) {
close $out;
$outsize = 0;
$filenumber++;
open $out, '>', "outfile_${file_number}.dat";
print $out $header, "\n"; # Prints header at beginning of file
# Newline needed if $header chomped
}
print $out $_, "\n"; # Newline needed if $_ chomped
$outsize += length;
}

merging two files using perl keeping the copy of original file in other file

I have to files like A.ini and B.ini ,I want to merge both the files in A.ini
examples of files:
A.ini::
a=123
b=xyx
c=434
B.ini contains:
a=abc
m=shank
n=paul
my output in files A.ini should be like
a=123abc
b=xyx
c=434
m=shank
n=paul
I want to this merging to be done in perl language and I want to keep the copy of old A.ini file at some other place to use old copy
A command line variant:
perl -lne '
($a, $b) = split /=/;
$v{$a} = $v{$a} ? $v{$a} . $b : $_;
END {
print $v{$_} for sort keys %v
}' A.ini B.ini >NEW.ini
How about:
#!/usr/bin/perl
use strict;
use warnings;
my %out;
my $file = 'path/to/A.ini';
open my $fh, '<', $file or die "unable to open '$file' for reading: $!";
while(<$fh>) {
chomp;
my ($key, $val) = split /=/;
$out{$key} = $val;
}
close $fh;
$file = 'path/to/B.ini';
open my $fh, '<', $file or die "unable to open '$file' for reading: $!";
while(<$fh>) {
chomp;
my ($key, $val) = split /=/;
if (exists $out{$key}) {
$out{$key} .= $val;
} else {
$out{$key} = $val;
}
}
close $fh;
$file = 'path/to/A.ini';
open my $fh, '>', $file or die "unable to open '$file' for writing: $!";
foreach(keys %out) {
print $fh $_,'=',$out{$_},"\n";
}
close $fh;
The two files to be merged can be read in a single pass and don't need to be treated as separate source files. That allows the use of <> to read all files passed as parameters on the command line.
Keeping a backup copy of A.ini is simply a matter of renaming it before writing the merged data to a new file of the same name.
This program appears to do what you need.
use strict;
use warnings;
my $file_a = $ARGV[0];
my (#keys, %values);
while (<>) {
if (/\A\s*(.+?)\s*=\s*(.+?)\s*\z/) {
push #keys, $1 unless exists $values{$1};
$values{$1} .= $2;
}
}
rename $file_a, "$file_a.bak" or die qq(Unable to rename "$file_a": $!);
open my $fh, '>', $file_a or die qq(Unable to open "$file_a" for output: $!);
printf $fh "%s=%s\n", $_, $values{$_} for #keys;
output (in A.ini)
a=123abc
b=xyx
c=434
m=shank
n=paul