Ignore the first two lines with ## in perl - perl

all.
Im a newbie in programming especially in perl. I would like to skip the first two lines in my dataset.
these are my codes.
while (<PEPTIDELIST>) {
next if $_ !=~ "##";
chomp $_;
#data = split /\t/;
chomp $_;
next if /Sequence/;
chomp $_;
$npeptides++;
# print "debug: 0: $data[0] 1: $data[1] 2: $data[2] 3:
$data[3]
\n" if ( $debug );
my $pepseq = $data[1];
#print $pepseq."\n";
foreach my $header (keys %sequence) {
#print "looking for $pepseq in $header \n";
if ($sequence{$header} =~ /$pepseq/ ) {
print "matched $pepseq in protein $header" if ( $debug );
# my $in =<STDIN>;
if ( $header =~ /(ENSGALP\S+)\s.+(ENSGALG\S+)/ ) {
print "debug: $1 $2 have the pep = $pepseq \n\n" if (
$debug);
my $lprot = $1;
my $lgene = $2;
$gccount{$lgene}++;
$pccount{$lprot}++;
# print "$1" if($debug);
# print "$2" if ($debug);
print OUT "$pepseq,$1,$2\n";
}
}
}
my $ngenes = keys %gccount;
my $nprots = keys %pccount;
somehow the peptide is not in the output list. please help point me where it goes wrong?
thanks

If you want to skip lines that contain ## anywhere in them:
next if /##/;
If you only want to skip lines that start with ##:
next if /^##/;
If you always want to skip the first two lines, regardless of content:
next if $. < 3;

next if $_ !=~ "##"; must be next if $_ =~ "##";
Ignore this lie if $_ matched ##

Related

Perl parse CSV file "fill" and "null" fields

Okay - I'm going to post my entire script since I get chastised when I don't do it - even though, last time I did that I got chastised for posting the whole script. I simply need to know if the one line I originally asked about would work. ENTIRE SCRIPT (which was working just fine until the other dept gave me their data entirely differently than what we were originally told it would be) TO FOLLOW AT THE END
I'm parsing through and scrubbing a CSV file to make it ready to be loaded in a MySQL table. It is loaded through the table via someone else's "batch Java program" and if any field is empty the batch file stops with an error.
I've been told to just put in a blank space whenever there's an empty field in any record. Would something as simple as this work?
if ( ! length $fields[2] ) {
$_ = ' ' for $fields[2];
}
And would there be a way to check either various multiple fields at once? Or what might be better would be to check ALL the fields (this is after the record has been split) as the last thing I do just before writing the record back out to the CSV file.
Here's the entire script. Please don't tell me how what I'm doing within the already working script is not how you would do it. -
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
use Time::Piece;
my $filename = 'mistints_1505_comma.csv';
#my $filename = 'test.csv';
# Open input file
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Open error handling file
open ( my $ERR_FH, '>', "errorFiles1505.csv" ) or die $!;
# Read the header line of the input file and print to screen.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
chomp($line);
# Scrub data of characters that cause scripting problems down the line.
$line =~ s/[\'\\]/ /g;
# split the fields of each record
my #fields = split(/,/, $line);
# Check if the storeNbr field is empty. If so, write record to error file.
if (!length $fields[28]) {
chomp (#fields);
my $str = join ',', #fields;
print $ERR_FH "$str\n";
}
else
{
# Concatenate the first three fields and add to the beginning of each record
unshift #fields, join '_', #fields[28..30];
# Format the DATE fields for MySQL
$_ = join '-', (split /\//)[2,0,1] for #fields[10,14,24,26];
# Scrub colons from the data
$line =~ s/:/ /g;
# If Spectro_Model is "UNKNOWN", change
if($fields[22] eq "UNKNOWN"){
$_ = 'UNKNOW' for $fields[22];
}
# If tran_date is blank, insert 0000-00-00
if(!length $fields[10]){
$_ = '0000-00-00' for $fields[10];
}
# If init_tran_date is blank, insert 0000-00-00
if(!length $fields[14]){
$_ = '0000-00-00' for $fields[14];
}
# If update_tran_date is blank, insert 0000-00-00
if(!length $fields[24]){
$_ = '0000-00-00' for $fields[24];
}
# If cancel_date is blank, insert 0000-00-00
if(!length $fields[26]){
$_ = '0000-00-00' for $fields[26];
}
# Format the PROD_NBR field by deleting any leading zeros before decimals.
$fields[12] =~ s/^\s*0\././;
# put the records back
push #data, \#fields;
}
}
close $FH;
close $ERR_FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[14] cmp $b->[14] ||
$a->[26] cmp $b->[26] ||
$a->[27] cmp $b-> [27]
} #data;
#open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/cleaned1505.csv';
print $OFH join(',', #$_), $/ for #data;
close $OFH;
exit;
As far as I can tell you have split a record on commas ,, and you want to alter all fields that are empty strings to contain a single space
I would write this
use strict;
use warnings 'all';
my $record = 'a,b,c,,e,,g,,i,,k,,m,n,o,p,q,r,s,t';
my #fields = map { $_ eq "" ? ' ' : $_ } split /,/, $record;
use Data::Dump;
dd \#fields;
output
[ "a", "b", "c", " ", "e", " ", "g", " ", "i", " ", "k", " ", "m" .. "t" ]
Alternatively, if you have some fields that need to be set to something different if they are empty, you can set up an array of defaults
That would look like this. All of the #defaults array is set to spaces except for fields 10, 11 and 12, which are 0000-00-00. These are picked up after the record is split
use strict;
use warnings 'all';
my #defaults = (' ') x 20;
$defaults[$_] = '0000-00-00' for 9, 10, 11;
my $record = 'a,b,c,,e,,g,,i,,k,,m,n,o,p,q,r,s,t';
my #fields = split /,/, $record;
for my $i ( 0 .. $#fields ) {
$fields[$i] = $defaults[$i] if $fields[$i] eq '';
}
use Data::Dump;
dd \#fields;
output
[ "a", "b", "c", " ", "e", " ", "g", " ", "i", "0000-00-00", "k", "0000-00-00", "m" .. "t" ]
Having seen your full program, I recommend something like this. If you had shown a sample of your input data then I could have used a hash to refer to column names instead of numbers, making it much more readable
#!/usr/bin/perl/
use strict;
use warnings 'all';
use Data::Dumper;
use Time::Piece;
my $filename = 'mistints_1505_comma.csv';
#my $filename = 'test.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
open( my $ERR_FH, '>', "errorFiles1505.csv" ) or die $!;
chomp( my $line = <$FH> );
my #fields = split /,/, $line; #/
print Dumper( \#fields ), "\n";
my #data;
# Read the lines one by one.
while ( <$FH> ) {
chomp;
# Scrub data of characters that cause scripting problems down the line.
tr/'\\/ /; #'
my #fields = split /,/; #/
# Check if the storeNbr field is empty. If so, write record to error file.
if ( $fields[28] eq "" ) {
my $str = join ',', #fields;
print $ERR_FH "$str\n";
next;
}
# Concatenate the first three fields and add to the beginning of each record
unshift #fields, join '_', #fields[ 28 .. 30 ];
# Format the DATE fields for MySQL
$_ = join '-', ( split /\// )[ 2, 0, 1 ] for #fields[ 10, 14, 24, 26 ];
# Scrub colons from the data
tr/://d; #/
my $i = 0;
for ( #fields ) {
# If "Spectro_Model" is "UNKNOWN" then change to "UNKNOW"
if ( $i == 22 ) {
$_ = 'UNKNOW' if $_ eq 'UNKNOWN';
}
# If a date field is blank then insert 0000-00-00
elsif ( grep { $i == $_ } 10, 14, 24, 26 ) {
$_ = '0000-00-00' if $_ eq "";
}
# Format the PROD_NBR field by deleting any leading zeros before decimals.
elsif ( $i == 12 ) {
s/^\s*0\././;
}
# Change all remaining empty fields to a single space
else {
$_ = ' ' if $_ eq "";
}
++$i;
}
push #data, \#fields;
}
close $FH;
close $ERR_FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
#data = sort {
$a->[0] cmp $b->[0] or
$a->[14] cmp $b->[14] or
$a->[26] cmp $b->[26] or
$a->[27] cmp $b->[27]
} #data;
#open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/cleaned1505.csv' or die $!;
print $OFH join(',', #$_), $/ for #data;
close $OFH;
Well if you did it before splitting into $fields, you ought to be able to do something like
# assuming a CSV line is in $_
#pad null at start of line
s/^,/ ,/;
#pad nulls in the middle
s/,,/, ,/g;
#pad null at the end
s/,$/, /;
Don't try to roll out your own CSV parsing code. Use Text::CSV or Text::CSV::Slurp.
With Text::CSV you could do something like
$line = $csv->string(); # get the combined string
$status = $csv->parse($line); # parse a CSV string into fields
#columns = map {defined $_ ? $_ : " "} $csv->fields(); # get the parsed fields
Are you really sure you want to replace nulls with spaces? I'd say if the field is undefined it should be NULL in db.

cant retrieve values from hash reversal (Perl)

I've initialized a hash with Names and their class ranking as follows
a=>5,b=>2,c=>1,d=>3,e=>5
I've this code so far
my %Ranks = reverse %Class; #As I need to find out who's ranked first
print "\nFirst place goes to.... ", $Ranks{1};
The code only prints out
"First place goes to...."
I want it to print out
First place goes to....c
Could you tell me where' I'm going wrong here?
The class hash prints correctly
but If I try to print the reversed hash using
foreach $t (keys %Ranks) {
print "\n $t $Ranks{$t}"; }
It prints
5
abc23
cab2
ord
If this helps in any way
FULL CODE
#Script to read from the data file and initialize it into a hash
my %Code;
my %Ranks;
#Check whether the file exists
open(fh, "Task1.txt") or die "The File Does Not Exist!\n", $!;
while (my $line = <fh>) {
chomp $line;
my #fields = split /,/, $line;
$Code{$fields[0]} = $fields[1];
$Class{$fields[0]} = $fields[2];
}
close(fh);
#Prints the dataset
print "Code \t Name\n";
foreach $code ( keys %Code) {
print "$code \t $Code{$code}\n";
}
#Find out who comes first
my %Ranks = reverse %Class;
foreach $t (keys %Ranks)
{
print "\n $t $Ranks{$t}";
}
print "\nFirst place goes to.... ", $Ranks{1}, "\n";
When you want to check what your data structures actually contain, use Data::Dumper. use Data::Dumper; local $Data::Dumper::Useqq = 1; print(Dumper(\%Class));. You'll find un-chomped newlines.
You need to use chomp. At present your $fields[2] value has a trailing newline.
Change your file read loop to this
while (my $line = <fh>) {
chomp $line;
my #fields = split /,/, $line;
$Code{$fields[0]} = $fields[1];
$Class{$fields[0]} = $fields[2];
}

basic regex and string manipulation for DNA analysis using perl

I am new to perl and would like to do what I think is some basic string manipulation to DNA sequences stored in an rtf file.
Essentially, my file reads (file is in FASTA format):
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
What I would like to do is read into my file and print the header (header is >LM1) then match the following DNA sequence GTGCCAGCAGCCGC and then print the preceding DNA sequence.
So my output would look like this:
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC
I have written the following program:
#!/usr/bin/perl
use strict; use warnings;
open(FASTA, "<seq_V3_V6_130227.rtf") or die "The file could not be found.\n";
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
my $dna = <FASTA>;
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$dna";
}
}
close(FASTA);
The problem is that my program reads the file line by line and the output I am receiving is the following:
>LM1
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC
Basically I don't know how to assign the entire DNA sequence to my $dna variable and ultimately don't know how to avoid reading the DNA sequence line by line. Also I am getting this warning:
Use of uninitialized value $dna in pattern match (m//) at stacked.pl line 14, line 1113.
If anyone could give me some help with writing better code or point me in the correct direction it would be much appreciated.
Using the pos function:
use strict;
use warnings;
my $dna = "";
my $seq = "GTGCCAGCAGCCGC";
while (<DATA>) {
if (/^>/) {
print;
} else {
if (/^[AGCT]/) {
$dna .= $_;
}
}
}
if ($dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
__DATA__
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
You can process a file with multiple entries like so:
while (<DATA>) {
if (/^>/) {
if ($dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
$dna = "";
}
print;
} elsif (/^[AGCT]/) {
$dna .= $_;
}
}
if ($dna && $dna =~ /$seq/g) {
print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
Your while statement reads until the end of file. That means at every loop iteration, $_ is the next line in <FASTA>. So $dna = <FASTA> isn't doing what you think it is. It is reading more than you probably want it to.
while(<FASTA>) { #Reads a line here
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
}
$dna = <FASTA> # reads another line here - Causes skips over every other line
}
Now, you need to read the sequence into your $dna. You can update your while loop with an else statement. So if its a head line, print it, else, we add it to $dna.
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
# It is a header line, so print it
my $header = $_;
print "$header\n";
} else {
# if it is not a header line, add to your dna sequence.
$dna .= $_;
}
}
After the loop, you can do your regex.
Note: This solution assumes there is only 1 sequence in the fasta file. If you have more than one, your $dna variable will have all the sequences as one.
Edit: Adding simple a way to handle multiple sequences
my $dna = "";
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
# Does $dna match the regex?
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$1\n";
}
# Reset the sequence
$dna = "";
# It is a header line, so print it
my $header = $_;
print "$header\n";
} else {
# if it is not a header line, add to your dna sequence.
$dna .= $_;
}
}
# Check the last sequence
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print "$1\n";
}
I came up with a solution using BioSeqIO (and the trunc method from BioSeq from the BioPerl distribution. I also used index to find the subsequence rather than using a regular expression.
This solution does not print out the id, (line begins with >), if the subsequence was not found or if the subsequence begins at the first postion, (and thus no preceding characters).
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => "fasta_junk.fasta" ,
-format => 'fasta');
my $out = Bio::SeqIO->new( -file => '>test.dat',
-format => 'fasta');
my $lookup = 'GTGCCAGCAGCCGC';
while ( my $seq = $in->next_seq() ) {
my $pos = index $seq->seq, $lookup;
# if $pos != -1, ($lookup not found),
# or $pos != 0, (found $lookup at first position, thus
# no preceding characters).
if ($pos > 0) {
my $trunc = $seq->trunc(1,$pos);
$out->write_seq($trunc);
}
}
__END__
*** fasta_junk.fasta
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT
*** contents of test.dat
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAAAGTACTGTCC
GTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTTGACGGTATCTAACCAGAAAG
CCACGGCTAACTAC
read the whole file into memory then look for the regexp
while(<FASTA>) {
chomp($_);
if ($_ =~ m/^>/ ) {
my $header = $_;
print "$header\n";
} else {
$dna .= $_;
}
}
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
print $1;
}

perl "eq" doesn't work well. I can't found my fault

I'm writting a small server with perl. There is some small problem. When the client give me a sentence like this "op:xxx:xxx:xxx", I'll get op. then do things depending on what op is. It works will if the op is adduser and so on. (I use if $op eq "adduser"...)
But when I get a "getList:xxx:xxx" and I have get the $op = getList, it can't pass it like "if $op eq "getList"". I know, it must be my mistake. But I just can't find it.
thank you for everyone.
use warnings;
use strict;
package MyPackage;
use base qw(Net::Server);
our %data_base;
our %tag_base;
sub list {
my %resault;
foreach ( keys %tag_base) {
print STDERR $_ . "1";
my #tags = split /:/, $tag_base{$_};
foreach ( #tags) {
$resault{$_} ++;
}
}
my #tags;
foreach ( keys %resault) {
push #tags, "$_,$resault{$_}";
}
$_ = join ";", #tags;
print ;
print STDERR ;
}
sub users {
my $topic = shift;
my #users;
foreach ( keys %tag_base) {
push #users, $_ if $tag_base{$_} =~ /$topic/;
}
$_ = join ";", #users;
print ;
}
sub process_request {
my $self = shift;
my $person;
my #info;
while (<STDIN>) {
my #gets = split /:/, $_;
print STDERR "#gets\n";
# $data_base{shift #person} = join ":", #person;
my $op = shift #gets;
$op =~ s/\s//;
print STDERR $op . "\n";
if ( $op eq "adduser") {
my $user_name = shift #gets;
if ( exists $data_base{$user_name}) {
print "already_exist";
} else {
$data_base{$user_name} = join ":", #gets;
print "addUserSu";
}
} elsif ( $op eq "login") {
my $login_name = shift #gets;
my $login_pw = shift #gets;
if ( defined $data_base{$login_name}) {
$person = $data_base{$login_name};
#info = split /:/, $person;
$info[0] =~ s/\s+//;
if ($login_pw eq $info[0]) {
print "$person";
} else {
print "/$info[0]/";
}
} else {
print "unexist_user";
}
} elsif ( $op eq "addTag") {
my $tag_user = shift #gets;
$tag_base{$tag_user} = join ":", #gets;
print "addTagSu";
} elsif ( $op eq "getList") {
print STDERR "right";
&list;
} elsif ( $op eq "getUsers") {
&users;
}
}
}
MyPackage->run(port => 13800);
I can see two (simple) reasons this might fail:
$op =~ s/\s//;
You only remove one whitespace: The first one. If your intention is to strip all whitespace, you'd want s/\s+//g.
And second:
Random capital letters in strings, variable names and commands is Evil. eq is case sensitive, so if $op is "getlist", then if ($op eq "getList") will be false. Unless capitalization is important to you, you could do if (lc($op) eq "getlist").
Without sample input, expected output and actual output, this is however nothing more than guesswork.
Also, as a debug statement, this is useless:
print STDERR $op . "\n";
That is easily confused and overlooked. For example, if $op is empty, it just produces a blank line in your error log. Use:
print STDERR "OP is: '$op'\n";
Now you will be able to identify the line where $op should appear, and you will be more easily see whitespace surrounding it.
You are reading strings without chomping them.
i.e.
When you run your code :
addtag:fred:barney
The input is stored as fred => "barney\n"
when you getList, the output is :
barney
,1;
I suspect the client is expecting 1 line of output that reads :
barney,1;
So, just add a chomp in your code here :
while (<STDIN>) {
chomp;
my #gets = split /:/, $_;

Cleanest Perl parser for Makefile-like continuation lines

A perl script I'm writing needs to parse a file that has continuation lines like a Makefile. i.e. lines that begin with whitespace are part of the previous line.
I wrote the code below but don't feel like it is very clean or perl-ish (heck, it doesn't even use "redo"!)
There are many edge cases: EOF at odd places, single-line files, files that start or end with a blank line (or non-blank line, or continuation line), empty files. All my test cases (and code) are here: http://whatexit.org/tal/flatten.tar
Can you write cleaner, perl-ish, code that passes all my tests?
#!/usr/bin/perl -w
use strict;
sub process_file_with_continuations {
my $processref = shift #_;
my $nextline;
my $line = <ARGV>;
$line = '' unless defined $line;
chomp $line;
while (defined($nextline = <ARGV>)) {
chomp $nextline;
next if $nextline =~ /^\s*#/; # skip comments
$nextline =~ s/\s+$//g; # remove trailing whitespace
if (eof()) { # Handle EOF
$nextline =~ s/^\s+/ /;
if ($nextline =~ /^\s+/) { # indented line
&$processref($line . $nextline);
}
else {
&$processref($line);
&$processref($nextline) if $nextline ne '';
}
$line = '';
}
elsif ($nextline eq '') { # blank line
&$processref($line);
$line = '';
}
elsif ($nextline =~ /^\s+/) { # indented line
$nextline =~ s/^\s+/ /;
$line .= $nextline;
}
else { # non-indented line
&$processref($line) unless $line eq '';
$line = $nextline;
}
}
&$processref($line) unless $line eq '';
}
sub process_one_line {
my $line = shift #_;
print "$line\n";
}
process_file_with_continuations \&process_one_line;
How about slurping the whole file into memory and processing it using regular expressions. Much more 'perlish'. This passes your tests and is much smaller and neater:
#!/usr/bin/perl
use strict;
use warnings;
$/ = undef; # we want no input record separator.
my $file = <>; # slurp whole file
$file =~ s/^\n//; # Remove newline at start of file
$file =~ s/\s+\n/\n/g; # Remove trailing whitespace.
$file =~ s/\n\s*#[^\n]+//g; # Remove comments.
$file =~ s/\n\s+/ /g; # Merge continuations
# Done
print $file;
If you don't mind loading the entire file in memory, then the code below passes the tests.
It stores the lines in an array, adding each line either to the previous one (continuation) or at the end of the array (other).
#!/usr/bin/perl
use strict;
use warnings;
my #out;
while( <>)
{ chomp;
s{#.*}{}; # suppress comments
next unless( m{\S}); # skip blank lines
if( s{^\s+}{ }) # does the line start with spaces?
{ $out[-1] .= $_; } # yes, continuation, add to last line
else
{ push #out, $_; } # no, add as new line
}
$, = "\n"; # set output field separator
$\ = "\n"; # set output record separator
print #out;