perl to merge csv files removing the headings - perl

I have several monthly reports in csv format in a folder. The csv files all have 8 common columns (with headings) . Using perl, I would like to merge these files together line by line.
say
file 1:
1,2,3,4,5,6,7,8,
a1,b1,c1,d1,e1,f1,g1,h1,
a1,b1,c1,d1,e1,f1,g1,h1,
a1,b1,c1,d1,e1,f1,g1,h1,
file 2:
1,2,3,4,5,6,7,8,
a2,b2,c2,d2,e2,f2,g2,h2,
a2,b2,c2,d2,e2,f2,g2,h2,
a2,b2,c2,d2,e2,f2,g2,h2,
I would like the output to look something like that (join the rows and remove the headings)
output:
1,2,3,4,5,6,7,8,
a1,b1,c1,d1,e1,f1,g1,h1,
a1,b1,c1,d1,e1,f1,g1,h1,
a1,b1,c1,d1,e1,f1,g1,h1,
a2,b2,c2,d2,e2,f2,g2,h2,
a2,b2,c2,d2,e2,f2,g2,h2,
a2,b2,c2,d2,e2,f2,g2,h2,
I have managed to save the names of the files in an array. but for some reason, I could not join them.
can you please help me figure out what is wrong with my code. I am quite new to perl.
#! C:Strawberry/perl/bin;
use feature ':5.12';
use strict;
use warnings;
my $data_directory = 'R:/testing_data/';
opendir( DIR, $data_directory ) or die "Could not open $data_directory $!\n";
my #files = grep {/_monthlyreport\.csv$/} readdir(DIR); #to get on the monthly reports csv files
foreach my $file (#files) {
open( HANR, "<", '$data_directory' . my $files ) or die "cannot open $files: $!"; #read handler
open( HANW, ">>", "G:/outputfile_script.csv" ) or die "error $! \n"; #write handler for creating new sorted files
my #lines = ();
#lines = <HANR>;
foreach my $line (#lines) {
chomp($line);
my $count++;
next unless $count; # skip header i.e the first line containing stock details
print HANW join $line, "\n";
}
my $count = -1;
close(HANW);
close(HANR);
}
closedir(DIR);
exit 0;

Your open statement to your input filehandle is malformed, and my $count++; is also broken.
I'd also recommend modernizing your code by using lexical file handles. The following is a cleaned up version of your code:
use feature ':5.12';
use strict;
use warnings;
use autodie;
my $data_directory = 'R:/testing_data/';
opendir my $dh, "$data_directory";
open my $outfh, ">>", "G:/outputfile_script.csv";
my $seenheader = 0;
while (my $file = readdir $dh) {
next unless $file =~ /_monthlyreport\.csv$/;
open my $infh, '<', "$data_directory/$file";
while (<$infh>) {
print $outfh $_ if $. > 1 || ! $seenheader++;
}
}

This line is wrong.
open(HANR ,"<",'$data_directory'.my $files) or die "cannot open $files: $!";
Should be
open(HANR ,"<","$data_directory".$files) or die "cannot open $files: $!";

Add a counter and stop printing if the counter equals 0;
#! C:Strawberry/perl/bin;
use feature ':5.12';
use strict;
use warnings;
my $data_directory = 'R:/testing_data/';
opendir(DIR,$data_directory) or die "Could not open $data_directory $!\n";
my #files = grep {/_monthlyreport\.csv$/} readdir(DIR); #to get on the monthly reports csv files
foreach my $file (#files) {
open(HANR ,"<",'$data_directory'.my $files) or die "cannot open $files: $!"; #read handler
open(HANW , ">>","G:/outputfile_script.csv") or die "error $! \n"; #write handler for creating new sorted files
my #lines=();
#lines=<HANR>;
my $i =0;
foreach my $line (#lines){
next if ($i==0) ;
chomp ($line) ;
my $count++;
next unless $count; # skip header i.e the first line containing stock details
print HANW join $line,"\n";
}
my $count= -1;
close(HANW);
close(HANR);
}
closedir(DIR);
exit 0;

Related

Write old fasta header and new to file

I want to extract the old fasta names which looks something like this:
>Bartonella bibbi
AUUCCGGUUGAUCCUGCCGGAGGCCACUGCUAUCGGGGUCCG
The new headers should look like this:
>Seq1
AUUCCGGUUGAUCCUGCCGGAGGCCACUGCUAUCGGGGUCCG
and so on...
The Bartonella Bibbi should be saved together with the new name Seq1 in a new file an so on. So I've started a bit, by looking for lines with >, and then I split to get an array to get the old name. I don't know how to continue, because I want two things here, first to put the new name in there, but also extracting the old name together with the new in a file, and ALSO get an output file with my sequence and my new names. Please, any input from you will help!
#!/usr/bin/perl
use warnings;
use strict;
my $infile = $ARGV[0];
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
while (my $line = <$IN>) {
if ($line =~ /^>/) {
my #header = split (/\>/, $line);
my $oldfasta = "$header[1]";
}
}
So after some edits, this is the current script:
#!/usr/bin/perl
use warnings;
use strict;
my $infile = $ARGV[0];
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $seqid = 1;
my %id;
while (my $line = <$IN>) {
if ($line =~ /^>/) {
$id{"Seq$seqid "} = $line;
print ">Seq$seqid\n";
$seqid++
} else {
print $line;
}
}
my $outfile = 'output';
open my $OUT, '>', $outfile or die "Could not open $outfile: $!, $?"; # overwrites the file $outfile;
print $OUT %id;
This gives me a file that looks like this:
Seq29 >Sulfophobococcus_zilligii
Seq20 >Pyrococcus_shinkaii
and so on.
They are not in order, how do I sort them and get rid of the > in the species name?
You’re simply not printing anything. Once you add a print statement, it should work.
In addition, it’s unclear what you’re using split for. Just increase a counter for the sequence:
#!/usr/bin/perl
use warnings;
use strict;
my $infile = $ARGV[0];
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $seqid = 1;
while (my $line = <$IN>) {
if ($line =~ /^>/) {
print ">Seq$seqid\n";
$seqid++;
} else {
print $line;
}
}
Simply write the new entries as you create them.
#!/usr/bin/perl
use warnings;
use strict;
my $infile = $ARGV[0];
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $outfile = 'output';
open my $OUT, '>', $outfile or die "Could not open $outfile: $!, $?"; # overwrites the file $outfile;
my $seqid = 1;
while (my $line = <$IN>) {
if ($line =~ /^>(.+)/) {
print $OUT "Seq$seqid\t$1\n"
print ">Seq$seqid\n";
$seqid++
} else {
print $line;
}
}
I tried to fix the indentation but left the gratutious variable for the $OUT file name.
If you want to keep the mapping in memory for other reasons (maybe to develop this into a much more complex script) using an array instead of a hash would seem like a natural way to keep the entries sorted; the new label is trivially derivable from the array index.

Perl Script: sorting through log files.

Trying to write a script which opens a directory and reads bunch of multiple log files line by line and search for information such as example:
"Attendance = 0 " previously I have used grep "Attendance =" * to search my information but trying to write a script to search for my information.
Need your help to finish this task.
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/path/';
opendir (DIR, $dir) or die $!;
while (my $file = readdir(DIR))
{
print "$file\n";
}
closedir(DIR);
exit 0;
What's your perl experience?
I'm assuming each file is a text file. I'll give you a hint. Try to figure out where to put this code.
# Now to open and read a text file.
my $fn='file.log';
# $! is a variable which holds a possible error msg.
open(my $INFILE, '<', $fn) or die "ERROR: could not open $fn. $!";
my #filearr=<$INFILE>; # Read the whole file into an array.
close($INFILE);
# Now look in #filearr, which has one entry per line of the original file.
exit; # Normal exit
I prefer to use File::Find::Rule for things like this. It preserves path information, and it's easy to use. Here's an example that does what you want.
use strict;
use warnings;
use File::Find::Rule;
my $dir = '/path/';
my $type = '*';
my #files = File::Find::Rule->file()
->name($type)
->in(($dir));
for my $file (#files){
print "$file\n\n";
open my $fh, '<', $file or die "can't open $file: $!";
while (my $line = <$fh>){
if ($line =~ /Attendance =/){
print $line;
}
}
}

Merge txt files in Perl, but modify them before, leaving original files untouched

I've already posted a question and fixed the problem in my code, but now my "specification has changed" so to say, and now I need to change some things about it.
Here's a code that takes all .txt files from the current directory, cuts off the last line of the first file, the first and the last line of every following file and the first line of the last file and writes everything in a new file (in other words: merge all files, deleting header and footer so that the new file has only one header and one footer).
#!/usr/bin/perl
use warnings;
use Cwd;
use Tie::File;
use Tie::Array;
my $cwd = getcwd();
my $buff = '';
# Get all files in cwd.
my #files = grep ( -f ,<*.txt>);
# Cut off header and footer of $files [1] to $files[$#files-1],
# but only footer of $files[0] and header of $#files[$#files]
for (my $i = 0; $i <= $#files; $i++) {
print 'Opening ' . $files[$i] . "\n";
tie (#lines, Tie::File, $files[$i]) or die "can't update $file: $!";
splice #lines, 0, 1 unless $i == 0;
splice #lines, -1, 1 unless $i == $#files;
untie #lines;
open (file, "<", $files[$i]) or die "can't update $file: $!";
while (my $line =<file>) {
$buff .= $line;
}
close file;
}
# Write the buffer to a new file.
my $allfilename = $cwd.'/Trace.txt';
print 'Writing all files into new file: ' . $allfilename . "\n";
open $outputfile, ">".$allfilename or die "can't write to new file $outputfile: $!";
# Write the buffer into the output file.
print $outputfile $buff;
close $outputfile;
My problem: I don't want to change the original files, but my code does exactly that and I'm having trouble coming up with a solution. The simplest way (simple meaning not having to change too much code) would now be, to just copy all the files to a tmp directory, messing around with them and leaving the original files untouched. Problem: a simple use of dircopy doesn't do it for me, since you have to give a new tmp dir to the dircopy function, making the code only usable for Windows or UNIX systems (but I need portability).
The next approach would be to make use of the File::Temp module but I'm really having trouble with the docs on this one.
Does anybody have a good idea on this one?
I suspected that you didn't really want your original files modified when I answered your previous question.
I don't understand why you've gone back to accumulating all the text in a buffer before printing it, or why you've removed use strict, which is essential to any well-written Perl code.
Here's my previous solution modified to leave the input data untouched.
use strict;
use warnings;
use Tie::File;
my #files = grep -f, glob '*.txt';
my $all_filename = 'Trace.txt';
open my $out_fh, '>', $all_filename or die qq{Unable to open "$all_filename" for output: $!};
for my $i ( 0 .. $#files ) {
my $file = $files[$i];
next if $file eq $all_filename;
print "Opening $file\n";
tie my #lines, 'Tie::File', $file or die qq{Can't open "$file": $!};
my ($start, $end) = (0, $#lines);
++$start unless $i == 0;
--$end unless $i == $#files;
print $out_fh "$_\n" for #lines[$start..$end];
}
close $out_fh;
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $outfile = 'Trace.txt';
# Get all files in cwd.
my #files = grep { -f && $_ ne $outfile } <*.txt>;
open my $outfh, '>', $outfile;
for my $file (#files) {
my #lines = do { local #ARGV = $file; <> };
shift #lines unless $file eq $files[0];
pop #lines unless $file eq $files[-1];
print $outfh #lines;
}
Just do not use Tie::File. Or is there a reason you do this, for example all your files together do not fit your memory or something?
A version very close to your current implementation would be something like the following (untested) code. It just skips the part where you update the file, just to reopen and read it afterwards. (Note that this is certainly not a very effective or overly elegant way to do this, it just sticks to your implementation as close as possible)
#!/usr/bin/perl
use warnings;
use Cwd;
# use Tie::File;
# use Tie::Array;
my $cwd = getcwd();
my $buff = '';
# Get all files in cwd.
my #files = grep ( -f ,<*.txt>);
# Cut off header and footer of $files [1] to $files[$#files-1],
# but only footer of $files[0] and header of $#files[$#files]
for (my $i = 0; $i <= $#files; $i++) {
print 'Opening ' . $files[$i] . "\n";
open (my $fh, "<", $files[$i]) or die "can't open $file for reading: $!";
my #lines = <$fh>;
splice #lines, 0, 1 unless $i == 0;
splice #lines, -1, 1 unless $i == $#files;
foreach my $line (#lines) {
$buff .= $line;
}
}
# Write the buffer to a new file.
my $allfilename = $cwd.'/Trace.txt';
print 'Writing all files into new file: ' . $allfilename . "\n";
open $outputfile, ">".$allfilename or die "can't write to new file $outputfile: $!";
# Write the buffer into the output file.
print $outputfile $buff;
close $outputfile;
Based on Miller's answer, but most suitable for large files.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $outfile = 'Trace.txt';
# Get all files in cwd.
my #files = grep { -f && $_ ne $outfile } <*.txt>;
open my $outfh, '>', $outfile;
my $counter = 0;
for my $file (#files) {
open my $fh, '<', $file;
my ($line, $prev) = ('', '');
my $l = 0;
while ($line = <$fh>) {
print $outfh $prev unless $l++ == 1 and $counter > 0;
$prev = $line;
}
$counter++;
print $outfh $prev if $counter == #files and $l > 0;
close $fh;
}

Perl - search and replace across multiple lines across multiple files in specified directory

At the moment this code replaces all occurences of my matching string with my replacement string, but only for the file I specify on the command line. Is there a way to change this so that all .txt files for example, in the same directory (the directory I specify) are processed without having to run this 100s of times on individual files?
#!/usr/bin/perl
use warnings;
my $filename = $ARGV[0];
open(INFILE, "<", $filename) or die "Cannot open $ARGV[0]";
my(#fcont) = <INFILE>;
close INFILE;
open(FOUT,">$filename") || die("Cannot Open File");
foreach $line (#fcont) {
$line =~ s/\<br\/\>\n([[:space:]][[:space:]][[:space:]][[:space:]][A-Z])/\n$1/gm;
print FOUT $line;
}
close INFILE;
I have also tried this:
perl -p0007i -e 's/\<br\/\>\n([[:space:]][[:space:]][[:space:]][[:space:]][A-Z])/\n$1/m' *.txt
But have noticed that is only changes the first occurence of the matched pattern and ignores all the rest in the file.
I also have tried this, but it doesn't work in the sense that it just creates a blank file:
use v5.14;
use strict;
use warnings;
use DBI;
my $source_dir = "C:/Testing2";
# Store the handle in a variable.
opendir my $dirh, $source_dir or die "Unable to open directory: $!";
my #files = grep /\.txt$/i, readdir $dirh;
closedir $dirh;
# Stop script if there aren't any files in the list
die "No files found in $source_dir" unless #files;
foreach my $file (#files) {
say "Processing $source_dir/$file";
open my $in, '<', "$source_dir/$file" or die "Unable to open $source_dir/$file: $!\n";
open(FOUT,">$source_dir/$file") || die("Cannot Open File");
foreach my $line (#files) {
$line =~ s/\<br\/\>\n([[:space:]][[:space:]][[:space:]][[:space:]][A-Z])/\n$1/gm;
print FOUT $line;
}
close $in;
}
say "Status: Processing of complete";
Just wondering what am I missing from my code above? Thanks.
You could try the following:
opendir(DIR,"your_directory");
my #all_files = readdir(DIR);
closedir(DIR);
for (#all_files) .....

copy text after a specific string from a file and append to another in perl

I want to extract the desired information from a file and append it into another. the first file consists of some lines as the header without a specific pattern and just ends with the "END OF HEADER" string. I wrote the following code for find the matching line for end of the header:
$find = "END OF HEADER";
open FILEHANDLE, $filename_path;
while (<FILEHANDLE>) {
my $line = $_;
if ($line =~ /$find/) {
#??? what shall I do here???
}
}
, but I don't know how can I get the rest of the file and append it to the other file.
Thank you for any help
I guess if the content of the file isn't enormous you can just load the whole file in a scalar and just split it with the "END OF HEADER" then print the output of the right side of the split in the new file (appending)
open READHANDLE, 'readfile.txt' or die $!;
my $content = do { local $/; <READHANDLE> };
close READHANDLE;
my (undef,$restcontent) = split(/END OF HEADER/,$content);
open WRITEHANDLE, '>>writefile.txt' or die $!;
print WRITEHANDLE $restcontent;
close WRITEHANDLE;
This code will take the filenames from the command line, print all files up to END OF HEADER from the first file, followed by all lines from the second file. Note that the output is sent to STDOUT so you will have to redirect the output, like this:
perl program.pl headfile.txt mainfile.txt > newfile.txt
Update Now modified to print all of the first file after the line END OF HEADER followed by all of the second file
use strict;
use warnings;
my ($header_file, $main_file) = #ARGV;
open my $fh, '<', $header_file or die $!;
my $print;
while (<$fh>) {
print if $print;
$print ||= /END OF HEADER/;
}
open $fh, '<', $main_file or die $!;
print while <$fh>;
use strict;
use warnings;
use File::Slurp;
my #lines = read_file('readfile.txt');
while ( my $line = shift #lines) {
next unless ($line =~ m/END OF HEADER/);
last;
}
append_file('writefile.txt', #lines);
I believe this will do what you need:
use strict;
use warnings;
my $find = 'END OF HEADER';
my $fileContents;
{
local $/;
open my $fh_read, '<', 'theFile.txt' or die $!;
$fileContents = <$fh_read>;
}
my ($restOfFile) = $fileContents =~ /$find(.+)/s;
open my $fh_write, '>>', 'theFileToAppend.txt' or die $!;
print $fh_write $restOfFile;
close $fh_write;
my $status = 0;
my $find = "END OF HEADER";
open my $fh_write, '>', $file_write
or die "Can't open file $file_write $!";
open my $fh_read, '<', $file_read
or die "Can't open file $file_read $!";
LINE:
while (my $line = <$fh_read>) {
if ($line =~ /$find/) {
$status = 1;
next LINE;
}
print $fh_write $line if $status;
}
close $fh_read;
close $fh_write;