Adding header of the first file to all the other split files in Perl - perl

I need to add header of the first main file to all the split files. i.e I am able to get header for the 1st split file but i need it for all the split files, here I am splitting DAT file. Below is what i have done so for:
#!usr/bin/perl -w
my $chunksize = 25000000; # 25MB
my $filenumber = 0;
my $infile = "Test.dat";
my $outsize = 0;
my $eof = 0;
my $line = $_;
open INFILE, $infile;
open OUTFILE, ">outfile_".$filenumber.".dat";
while (<INFILE>) {
chomp;
if ($outsize > $chunksize) {
close OUTFILE;
$outsize = 0;
$filenumber++;
open (OUTFILE, ">outfile_".$filenumber.".dat")
or die "Can't open outfile_".$filenumber.".dat";
}
print OUTFILE "$_\n";
$outsize += length;
}
close INFILE;

You should always use warnings (in preference to the command-line -w) and use strict. That way many simple errors that you may otherwise have obverlooked will be flagged
Use the three-parameter form of open with lexical filehandles
Check the result of all open calls and flag errors containing the value of $! in a die string
Define constant values with the use constant pragma father than as Perl variables
The number of bytes printed to a filehandle can be evaluated using the tell function, so there is no need to keep your own count
To solve your specific problem, you should read and remember the first line of your input file, and print it to new output files every time they are opened
It is easier to keep track of the output files if you open them when you have new data to write and no open file, and close them when they are full or if you have reached the end of the input data
This program demonstrates the ideas and does what is required
use strict;
use warnings;
use constant INFILE => 'Test.dat';
use constant CHUNKSIZE => 25_000_000; # 25MB
open my $infh, '<', INFILE or die $!;
my $header = <$infh>;
my $outfh;
my $filenumber = 0;
while (my $line = <$infh>) {
unless ($outfh) {
my $outfile = "outfile_$filenumber.dat";
open $outfh, '>', $outfile or die "Can't open '$outfile': $!";
print { $outfh } $header;
$filenumber++;
}
print { $outfh } $line;
if (tell $outfh > CHUNKSIZE or eof $infh) {
close $outfh or die $!;
undef $outfh;
}
}

You need to store the header from the input file and print it every time a new file is opened:
use strict;
use warnings;
use autodie;
# initializations ...
open my $in, '<', $infile;
open my $out, '>', "outfile_${file_number}.dat";
my $header = <$in>; # Save the header...
chomp $header; # ... not strictly necessary
while ( <$in> ) {
chomp; # Not strictly necessary
if ( $outsize > $chunksize) {
close $out;
$outsize = 0;
$filenumber++;
open $out, '>', "outfile_${file_number}.dat";
print $out $header, "\n"; # Prints header at beginning of file
# Newline needed if $header chomped
}
print $out $_, "\n"; # Newline needed if $_ chomped
$outsize += length;
}

Related

How to use text output from one perl script as input to another perl script? I seem to be able run this as two separate scripts but not as one

I've got a script that reformats an input file and creates an output file. When I try to read that output file for the second part of the script, it doesn't work. However if I split the script into two parts it works fine and gives me the output that I need. I'm not a programmer and surprised I've got this far - I've been banging my head for days trying to resolve this.
My command for running it is this (BTW the temp.txt was just a brute force workaround for getting rid of the final comma to get my final output file - couldn't find another solution):
c:\perl\bin\perl merge.pl F146.sel temp.txt F146H.txt
Input looks like this (from another software package) ("F146.sel"):
/ Selected holes from the .\Mag_F146_Trimmed.gdb database.
"L12260"
"L12270"
"L12280"
"L12290"
Output looks like this (mods to the text: quotes removed, insert comma, concatenate into one line, remove the last comma) "F146H.txt":
L12260,L12270,L12280,L12290
Then I want to use this as input in the next part of the script, which basically inserts this output into a line of code that I can use in another software package (my "merge.gs" file). This is the output that I get if I split my script into two parts, but it just gives me a blank if I do it as one (see below).
CURRENT Database,"RAD_F146.gdb"
SETINI MERGLINE.OUT="DALL"
SETINI MERGLINE.LINES="L12260,L12270,L12280,L12290"
GX mergline.gx
What follows is my "merge.pl". What have I done wrong?
(actually, the question could be - what haven't I done wrong, as this is probably the most retarded code you've seen in a while. In fact, I bet some of you could get this entire operation done in 10-15 lines of code, instead of my butchered 90. Thanks in advance.)
# this reformats the SEL file to remove the first line and replace the " with nothing
$file = shift ;
$temp = shift ;
$linesH = shift ;
#open (Profiles, ">.\\scripts\\P2.gs")||die "couldn't open output .gs file";
open my $in, '<', $file or die "Can't read old file: Inappropriate I/O control operation";
open my $out, '>', $temp or die "Can't write new file: Inappropriate I/O control operation";
my $firstLine = 1;
while( <$in> )
{
if($firstLine)
{
$firstLine = 0;
}
else{
s/"L/L/g; # replace "L with L
s/"/,/g; # replace " with,
s|\s+||; # concatenates it all into one line
print $out $_;
}
}
close $out;
open (part1, "${temp}")||die "Couldn't open selection file";
open (part2, ">${linesH}")||die "Couldn't open selection file";
printitChomp();
sub printitChomp
{
print part2 <<ENDGS;
ENDGS
}
while ($temp = <part1> )
{
print $temp;
printit();
}
sub printit
{$string = substr (${temp}, 0,-1);
print part2 <<ENDGS;
$string
ENDGS
}
####Theoretically this creates the merge script from the output
####file from the previous loop. However it only seems to work
####if I split this into 2 perl scripts.
open (MergeScript, ">MergeScript.gs")||die "couldn't open output .gs file";
printitMerge();
open (SEL, "${linesH}")||die "Couldn't open selection file";
sub printitMerge
#open .sel file
{
print MergeScript <<ENDGS;
ENDGS
}
#iterate over required files
while ( $line = <SEL> ){
chomp $line;
print STDOUT $line;
printitLines();
}
sub printitLines
{
print MergeScript <<ENDGS;
CURRENT Database,"RAD_F146.gdb"
SETINI MERGLINE.OUT="DALL"
SETINI MERGLINE.LINES="${line}"
GX mergline.gx
ENDGS
}
so I think all you were really missing was close(part2); to allow it to be reopened as SEL..
#!/usr/bin/env perl
use strict;
use warnings;
# this reformats the SEL file to remove the first line and replace the " with nothing
my $file = shift;
my $temp = shift;
my $linesH = shift;
open my $in, '<', $file or die "Can't read old file: Inappropriate I/O control operation";
open my $out, '>', $temp or die "Can't write new file: Inappropriate I/O control operation";
my $firstLine = 1;
while (my $line = <$in>){
print "LINE: $line\n";
if ($firstLine){
$firstLine = 0;
} else {
$line =~ s/"L/L/g; # replace "L with L
$line =~ s/"/,/g; # replace " with,
$line =~ s/\s+//g; # concatenates it all into one line
print $out $line;
}
}
close $out;
open (part1, $temp) || die "Couldn't open selection file";
open (part2, ">", $linesH) || die "Couldn't open selection file";
while (my $temp_line = <part1>){
print "TEMPLINE: $temp_line\n";
my $string = substr($temp_line, 0, -1);
print part2 <<ENDGS;
$string
ENDGS
}
close(part2);
#### this creates the merge script from the output
#### file from the previous loop.
open (MergeScript, ">MergeScript.gs")||die "couldn't open output .gs file";
open (SEL, $linesH) || die "Couldn't open selection file";
#iterate over required files
while ( my $sel_line = <SEL> ){
chomp $sel_line;
print STDOUT $sel_line;
print MergeScript <<"ENDGS";
CURRENT Database,"RAD_F146.gdb"
SETINI MERGLINE.OUT="DALL"
SETINI MERGLINE.LINES="$sel_line"
GX mergline.gx
ENDGS
}
and one alternative way of doing it..
#!/usr/bin/env perl
use strict;
use warnings;
my $file = shift;
open my $in, '<', $file or die "Can't read old file: Inappropriate I/O control operation";
my #lines = <$in>; # read in all the lines
shift #lines; # discard the first line
my $line = join(',', #lines); # join the lines with commas
$line =~ s/[\r\n"]+//g; # remove the quotes and newlines
# print the line into the mergescript
open (MergeScript, ">MergeScript.gs")||die "couldn't open output .gs file";
print MergeScript <<"ENDGS";
CURRENT Database,"RAD_F146.gdb"
SETINI MERGLINE.OUT="DALL"
SETINI MERGLINE.LINES="$line"
GX mergline.gx
ENDGS

Merge txt files in Perl, but modify them before, leaving original files untouched

I've already posted a question and fixed the problem in my code, but now my "specification has changed" so to say, and now I need to change some things about it.
Here's a code that takes all .txt files from the current directory, cuts off the last line of the first file, the first and the last line of every following file and the first line of the last file and writes everything in a new file (in other words: merge all files, deleting header and footer so that the new file has only one header and one footer).
#!/usr/bin/perl
use warnings;
use Cwd;
use Tie::File;
use Tie::Array;
my $cwd = getcwd();
my $buff = '';
# Get all files in cwd.
my #files = grep ( -f ,<*.txt>);
# Cut off header and footer of $files [1] to $files[$#files-1],
# but only footer of $files[0] and header of $#files[$#files]
for (my $i = 0; $i <= $#files; $i++) {
print 'Opening ' . $files[$i] . "\n";
tie (#lines, Tie::File, $files[$i]) or die "can't update $file: $!";
splice #lines, 0, 1 unless $i == 0;
splice #lines, -1, 1 unless $i == $#files;
untie #lines;
open (file, "<", $files[$i]) or die "can't update $file: $!";
while (my $line =<file>) {
$buff .= $line;
}
close file;
}
# Write the buffer to a new file.
my $allfilename = $cwd.'/Trace.txt';
print 'Writing all files into new file: ' . $allfilename . "\n";
open $outputfile, ">".$allfilename or die "can't write to new file $outputfile: $!";
# Write the buffer into the output file.
print $outputfile $buff;
close $outputfile;
My problem: I don't want to change the original files, but my code does exactly that and I'm having trouble coming up with a solution. The simplest way (simple meaning not having to change too much code) would now be, to just copy all the files to a tmp directory, messing around with them and leaving the original files untouched. Problem: a simple use of dircopy doesn't do it for me, since you have to give a new tmp dir to the dircopy function, making the code only usable for Windows or UNIX systems (but I need portability).
The next approach would be to make use of the File::Temp module but I'm really having trouble with the docs on this one.
Does anybody have a good idea on this one?
I suspected that you didn't really want your original files modified when I answered your previous question.
I don't understand why you've gone back to accumulating all the text in a buffer before printing it, or why you've removed use strict, which is essential to any well-written Perl code.
Here's my previous solution modified to leave the input data untouched.
use strict;
use warnings;
use Tie::File;
my #files = grep -f, glob '*.txt';
my $all_filename = 'Trace.txt';
open my $out_fh, '>', $all_filename or die qq{Unable to open "$all_filename" for output: $!};
for my $i ( 0 .. $#files ) {
my $file = $files[$i];
next if $file eq $all_filename;
print "Opening $file\n";
tie my #lines, 'Tie::File', $file or die qq{Can't open "$file": $!};
my ($start, $end) = (0, $#lines);
++$start unless $i == 0;
--$end unless $i == $#files;
print $out_fh "$_\n" for #lines[$start..$end];
}
close $out_fh;
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $outfile = 'Trace.txt';
# Get all files in cwd.
my #files = grep { -f && $_ ne $outfile } <*.txt>;
open my $outfh, '>', $outfile;
for my $file (#files) {
my #lines = do { local #ARGV = $file; <> };
shift #lines unless $file eq $files[0];
pop #lines unless $file eq $files[-1];
print $outfh #lines;
}
Just do not use Tie::File. Or is there a reason you do this, for example all your files together do not fit your memory or something?
A version very close to your current implementation would be something like the following (untested) code. It just skips the part where you update the file, just to reopen and read it afterwards. (Note that this is certainly not a very effective or overly elegant way to do this, it just sticks to your implementation as close as possible)
#!/usr/bin/perl
use warnings;
use Cwd;
# use Tie::File;
# use Tie::Array;
my $cwd = getcwd();
my $buff = '';
# Get all files in cwd.
my #files = grep ( -f ,<*.txt>);
# Cut off header and footer of $files [1] to $files[$#files-1],
# but only footer of $files[0] and header of $#files[$#files]
for (my $i = 0; $i <= $#files; $i++) {
print 'Opening ' . $files[$i] . "\n";
open (my $fh, "<", $files[$i]) or die "can't open $file for reading: $!";
my #lines = <$fh>;
splice #lines, 0, 1 unless $i == 0;
splice #lines, -1, 1 unless $i == $#files;
foreach my $line (#lines) {
$buff .= $line;
}
}
# Write the buffer to a new file.
my $allfilename = $cwd.'/Trace.txt';
print 'Writing all files into new file: ' . $allfilename . "\n";
open $outputfile, ">".$allfilename or die "can't write to new file $outputfile: $!";
# Write the buffer into the output file.
print $outputfile $buff;
close $outputfile;
Based on Miller's answer, but most suitable for large files.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $outfile = 'Trace.txt';
# Get all files in cwd.
my #files = grep { -f && $_ ne $outfile } <*.txt>;
open my $outfh, '>', $outfile;
my $counter = 0;
for my $file (#files) {
open my $fh, '<', $file;
my ($line, $prev) = ('', '');
my $l = 0;
while ($line = <$fh>) {
print $outfh $prev unless $l++ == 1 and $counter > 0;
$prev = $line;
}
$counter++;
print $outfh $prev if $counter == #files and $l > 0;
close $fh;
}

how to count the number of specific characters through each line from file?

I'm trying to count the number of 'N's in a FASTA file which is:
>Header
AGGTTGGNNNTNNGNNTNGN
>Header2
AGNNNNNNNGNNGNNGNNGN
so in the end I want to get the count of number of 'N's and each header is a read so I want to make a histogram so I would at the end output something like this:
# of N's # of Reads
0 300
1 240
etc...
so there are 300 sequences or reads that have 0 number of 'N's
use strict;
use warnings;
my $file = shift;
my $output_file = shift;
my $line;
my $sequence;
my $length;
my $char_N_count = 0;
my #array;
my $count = 0;
if (!defined ($output_file)) {
die "USAGE: Input FASTA file\n";
}
open (IFH, "$file") or die "Cannot open input file$!\n";
open (OFH, ">$output_file") or die "Cannot open output file $!\n";
while($line = <IFH>) {
chomp $line;
next if $line =~ /^>/;
$sequence = $line;
#array = split ('', $sequence);
foreach my $element (#array) {
if ($element eq 'N') {
$char_N_count++;
}
}
print "$char_N_count\n";
}
Try this. I changed a few things like using scalar file handles. There are many ways to do this in Perl, so some people will have other ideas. In this case I used an array which may have gaps in it - another option is to store results in a hash and key by the count.
Edit: Just realised I'm not using $output_file, because I have no idea what you want to do with it :) Just change the 'print' at the end to 'print $out_fh' if your intent is to write to it.
use strict;
use warnings;
my $file = shift;
my $output_file = shift;
if (!defined ($output_file)) {
die "USAGE: $0 <input_file> <output_file>\n";
}
open (my $in_fh, '<', $file) or die "Cannot open input file '$file': $!\n";
open (my $out_fh, '>', $output_file) or die "Cannot open output file '$output_file': $!\n";
my #results = ();
while (my $line = <$in_fh>) {
next if $line =~ /^>/;
my $num_n = ($line =~ tr/N//);
$results[$num_n]++;
}
print "# of N's\t# of Reads\n";
for (my $i = 0; $i < scalar(#results) ; $i++) {
unless (defined($results[$i])) {
$results[$i] = 0;
# another option is to 'next' if you don't want to show the zero totals
}
print "$i\t\t$results[$i]\n";
}
close($in_fh);
close($out_fh);
exit;

copy text after a specific string from a file and append to another in perl

I want to extract the desired information from a file and append it into another. the first file consists of some lines as the header without a specific pattern and just ends with the "END OF HEADER" string. I wrote the following code for find the matching line for end of the header:
$find = "END OF HEADER";
open FILEHANDLE, $filename_path;
while (<FILEHANDLE>) {
my $line = $_;
if ($line =~ /$find/) {
#??? what shall I do here???
}
}
, but I don't know how can I get the rest of the file and append it to the other file.
Thank you for any help
I guess if the content of the file isn't enormous you can just load the whole file in a scalar and just split it with the "END OF HEADER" then print the output of the right side of the split in the new file (appending)
open READHANDLE, 'readfile.txt' or die $!;
my $content = do { local $/; <READHANDLE> };
close READHANDLE;
my (undef,$restcontent) = split(/END OF HEADER/,$content);
open WRITEHANDLE, '>>writefile.txt' or die $!;
print WRITEHANDLE $restcontent;
close WRITEHANDLE;
This code will take the filenames from the command line, print all files up to END OF HEADER from the first file, followed by all lines from the second file. Note that the output is sent to STDOUT so you will have to redirect the output, like this:
perl program.pl headfile.txt mainfile.txt > newfile.txt
Update Now modified to print all of the first file after the line END OF HEADER followed by all of the second file
use strict;
use warnings;
my ($header_file, $main_file) = #ARGV;
open my $fh, '<', $header_file or die $!;
my $print;
while (<$fh>) {
print if $print;
$print ||= /END OF HEADER/;
}
open $fh, '<', $main_file or die $!;
print while <$fh>;
use strict;
use warnings;
use File::Slurp;
my #lines = read_file('readfile.txt');
while ( my $line = shift #lines) {
next unless ($line =~ m/END OF HEADER/);
last;
}
append_file('writefile.txt', #lines);
I believe this will do what you need:
use strict;
use warnings;
my $find = 'END OF HEADER';
my $fileContents;
{
local $/;
open my $fh_read, '<', 'theFile.txt' or die $!;
$fileContents = <$fh_read>;
}
my ($restOfFile) = $fileContents =~ /$find(.+)/s;
open my $fh_write, '>>', 'theFileToAppend.txt' or die $!;
print $fh_write $restOfFile;
close $fh_write;
my $status = 0;
my $find = "END OF HEADER";
open my $fh_write, '>', $file_write
or die "Can't open file $file_write $!";
open my $fh_read, '<', $file_read
or die "Can't open file $file_read $!";
LINE:
while (my $line = <$fh_read>) {
if ($line =~ /$find/) {
$status = 1;
next LINE;
}
print $fh_write $line if $status;
}
close $fh_read;
close $fh_write;

How can I delete the last 10 lines of a file in perl

I am taking a total number of line as a user input and then I am deleting those numbers of l ine from the file.
I saw this learn.perl.org/faq/perlfaq5.html#How-do-I-count-the-number-of-lines-in-a-file- and then I tired the below simple logic.
Logic:
Get the Total number of lines
Subtracts it by the numbers entered by user
print the lines
Here is my code :
#!/usr/bin/perl -w
use strict;
open IN, "<", "Delete_line.txt"
or die " Can not open the file $!";
open OUT, ">", "Update_delete_line.txt"
or die "Can not write in the file $!";
my ($total_line, $line, $number, $printed_line);
print"Enter the number of line to be delete\n";
$number = <STDIN>;
while ($line = <IN>) {
$total_line = $.; # Total number of line in the file
}
$printed_line = $total_line - $number;
while ($line = <IN>) {
print OUT $line unless $.== $printed_line;
}
Well, neither i am getting any error in code nor any out put ? why I just don't know.
Can any one give me some suggestion.
A Perl solution that's efficient for large files requires the use of File::ReadBackwards
use File::ReadBackwards qw( );
my $num_lines = 10;
my $qfn = 'file.txt';
my $pos = do {
my $fh = File::ReadBackwards->new($qfn)
or die $!;
$fh->readline() for 1..$num_lines;
$fh->tell()
};
truncate($qfn, $pos)
or die $!;
This does not read the whole file twice (unlike the OP's method).
This does not read the whole file (unlike the Tie::File solutions).
This does not read the whole file into memory.
Yet another way is to use Tie::File
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::File;
tie my #lines, 'Tie::File', 'myfile' or die "$!\n";
$#lines -= 10;
untie #lines;
This has the advantage of not loading the file into memory while acting like it does.
Here a solution that passes through a stream and prints all but the last n lines where n is a command line argument:
#!/usr/bin/perl
my #cache;
my $n = shift #ARGV;
while(<>) {
push #cache, $_;
print shift #cache if #cache > $n;
}
or the one-liner version:
perl -ne'BEGIN{$n=shift#ARGV}push#c,$_;print shift#c if#c>$n' NUMBER
After finishing reading from IN, you have to reopen it or seek IN, 0, 0 to reset its position. You also have to set $. to zero again.
Also, the final condition should be changed to unless $. > $printed_line so you skip all the lines over the threshold.
The "more fun" answer: use Tie::File!
use strict;
use warnings;
use Tie::File;
tie my #file, 'Tie::File', 'filename' or die "$!";
$#file -= 10;
Just read the file in reverse and delete the first n lines: -
open my $filehandle, "<", "info.txt";
my #file = <$filehandle>;
splice(#file, -10);
print #file;
Note: This loads the entire file into memory.
You could just buffer the last 10 lines and then not print out the remaining 10.
use English qw<$INPLACE_EDIT>;
{ local #ARGV = $name_of_file_to_edit;
local $INPLACE_EDIT = '.bak';
my #buffer;
for ( 1..$num_lines_to_trim ) {
push #buffer, <>;
}
while ( <> ) {
print shift #buffer;
push #buffer, $_;
}
}
You could also do this with File::Slurp::edit_file_lines:
my #buffer;
my $limit_reached = 0;
edit_file_lines {
push #buffer, $_;
return ( $limit_reached ||= #buffer > $num_lines_to_trim ) ? shift #buffer
: ''
;
} $name_of_file;
my $num_lines = 10;
my $qfn = 'file.txt';
system('head', '-n', -$num_lines, '--', $qfn);
die "Error" if $?;
Easy with a C like for :
#!/usr/bin/perl -w
use strict;
open(my $in,"<","Delete_line.txt") or die "Can not open the file $!";
open(my $out,">","Update_delete_line.txt") or die"Can not write in the file $!";
print"Enter the number of lines to be delete\n";
my $number=<STDIN>;
my #file = <$in>;
for (my $i = 0; $i < $#file - $number + 1; $i++) {
print $out $file[$i];
}
close $in;
close $out;
#
# Reads a file trims the top and the bottom of by passed num of lines
# and return the string
# stolen from : http://stackoverflow.com/a/9330343/65706
# usage :
# my $StrCatFile = $objFileHandler->ReadFileReturnTrimmedStrAtTopBottom (
# $FileToCat , $NumOfRowsToRemoveAtTop , $NumOfRowsToRemoveAtBottom) ;
sub ReadFileReturnTrimmedStrAtTopBottom {
my $self = shift ;
my $file = shift ;
my $NumOfLinesToRemoveAtTop = shift ;
my $NumOfLinesToRemoveAtBottom = shift ;
my #cache ;
my $StrTmp = () ;
my $StrReturn = () ;
my $fh = () ;
open($fh, "<", "$file") or cluck ( "can't open file : $file for reading: $!" ) ;
my $counter = 0;
while (<$fh>) {
if ($. >= $NumOfLinesToRemoveAtTop + 1) {
$StrTmp .= $_ ;
}
}
close $fh;
my $sh = () ;
open( $sh, "<", \$StrTmp) or cluck( "can't open string : $StrTmp for reading: $!" ) ;
while(<$sh>) {
push ( #cache, $_ ) ;
$StrReturn .= shift #cache if #cache > $NumOfLinesToRemoveAtBottom;
}
close $sh ;
return $StrReturn ;
}
#eof ReadFileReturnTrimmedStrAtTopBottom
#