Parsing an XML file line by line

Parsing an XML file line by line - perl

So here is the issue. I am trying to parse a XML file of information from GenBank. This file contains information on multiple DNA sequences. I have this done already for two other xml formats from genbacnk (TINY xml and INSD xml), but pure xml gives me a headache. Here's how my program should work. Download an xml formated file that contains information on X number of sequences from GenBank. Run my perl script that searches through that xml file line by line and prints the information I want to a new file, in fasta format. Which is this: >Sequence_name_and_information\n sequences\n >sequence_name.... and on and on until you have all the sequences from the xml file. My issue though is that in pure xml the sequence itself comes before the identifier for the gene or locus of the sequences. The gene or locus of the sequences should go in the same line as the ">". Here is the code I have from the point of opening the file and parsing through it:
open( New_File, "+>$PWD_file/$new_file" ) or die "\n\nCouldn't create file. Check permissions on location.\n\n";
while ( my $lines = <INSD> ) {
foreach ($lines) {
if (m/<INSDSeq_locus>.*<\/INSDSeq_locus>/) {
$lines =~ s/<INSDSeq_locus>//g and $lines =~ s/<\/INSDSeq_locus>//g and $lines =~ s/[a-z, |]//g; #this last bit may cause a bug of removing the letters in the genbank accession number
$lines =~ s/ //g;
chomp($lines);
print New_File ">$lines\_";
} elsif (m/<INSDSeq_organism>.*<\/INSDSeq_organism>/) {
$lines =~ s/<INSDSeq_organism>//g and $lines =~ s/<\/INSDSeq_organism>//g;
$lines =~ s/(\.|\?|\-| )/_/g;
$lines =~ s/_{2,}/_/g;
$lines =~ s/_{1,}$//;
$lines =~ s/^>*_{1,}//;
$lines =~ s/\s{2}//g;
chomp($lines);
print New_File "$lines\n";
} elsif (m/<INSDSeq_sequence>.*<\/INSDSeq_sequence>/) {
$lines =~ s/<INSDSeq_sequence>//g and $lines =~ s/<\/INSDSeq_sequence>//g;
$lines =~ s/ //g;
chomp($lines);
print New_File "$lines\n";
}
}
}
close INSD;
close New_File;
}
There are two places to find Gene/locus information. That info is found between either on of these two tags: LOCUS_NAME or GENE_NAME. There will be one, or the other. If one has info the other will be empty. In either case both need to be added to the end of the >....... line.
Thanks,
AlphaA
PS--I tried to print that info to a "file" by doing open "$NA", ">" the sequence to that, then moving on with the program, finding the gene info, printing it to the > line and then read the $NA file and printing it to the line right after the > line. I hope this is clear.

In my opinion you should use XSLT with XPath to navigate to the data you need.
As #Brian suggests, it is easier to use established XML parsing techniques and libraries.
There is even a Perl library for XSLT

Use an XML parser. I'm not a biologist, and I'm not sure of the final format you want, but it should be simple with this as a starting point. $_[1] in the anonymous sub contains a hash reference with, from what I can tell above, everything that I think you want saved from parsing the parent tag of the tags you want. It should be easy to print out the elements of $_[1] in the format that you want it to be in:
use strict;
use warnings;
use XML::Rules;
use Data::Dumper;
my #rules = (
_default => '',
'INSDSeq_locus,INSDSeq_organism,INSDSeq_sequence' => 'content',
INSDSeq => sub { delete $_[1]{_content}; print Dumper $_[1]; return },
);
my $p = XML::Rules->new(rules => \#rules);
$p->parsefile('sequence.gbc.xml');
And that is just so that printing just the tags you want is easy. Or, if you want some other tags, What I really might do is this (you don't really need the #tags variable at all if you're just printing element by element):
my #tags = qw(
INSDSeq_locus
INSDSeq_organism
INSDSeq_sequence
);
my #rules = (
_default => 'content',
# Elements are, e.g. $_[1]{INSDSeq_locus}
INSDSeq => sub { print "$_: $_[1]{$_}\n" for #tags; return; },
);
with:
my $p = XML::Rules->new(rules => \#rules, stripspaces => 4);

Related

Find a line by using filereader in perl

I created a text-box .I want to control what the user writes in the text-box by reading a text file and comparing each line with the text using Perl. In my code I filled to param('text')
open(DATA, "<baba.txt") or die "Couldn't open file file.txt, $!";
while(<DATA>)
{
if($_=~param('text'))
{
print $_;
}
}
I have no problem while reading the file but i couldn't handle the matches. It returned nothing.
What is wrong with my code?

The right side of the =~ operator needs to be a regular expression.
See this site for more details.

while (<DATA>)
{
chomp;
if (param('text') =~ /\Q$_/)
{
print $_;
}
}

Perhaps the following will be helpful:
use strict;
use warnings;
my $text = <<END;
This is just a BU.NCH of text
in a here document that will
be used for some matching in
just a little bit.
END
while (<DATA>) {
chomp;
if ( $text =~ /\b\Q$_\E\b/i ) {
print $_, "\n";
}
}
__DATA__
this
some
bu.nch
a
hello
world
Output:
this
some
bu.nch
a
Before attempting to match a word read from a file, you need to chomp it, to remove the record separator (if any), which is usually \n. There are also a few other items for you to consider:
Whether you want a case-insensitive match
Escaping any meta-characters in your words which may be present
Forcing word borders to prevent an in-string match.
Item (1) above is achieved by using the /i modifier. Item (2) is done by enclosing the 'word' in the regex like this: \Q$_\E (\Quote-meta; \End Quote-meta). And the last uses \b: \b\Q$_\E\b.
Hope this helps!

try this
my $text = param('text');
open(DATA, "<baba.txt") or die "Couldn't open file file.txt, $!";
while(<DATA>)
{
if($_=~ /$text/ )
{
print $_;
}
}
because this -> $_=~param('text') not regexp search

In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values

I've searched around the site and surprisingly I can't seem to find something that will work for my particular problem. So I figured I'd post it and see how some of you more experienced programmers can address with problem.
I have a spreadsheet like text file (many lines with tab delimited columns), that I would like to search through for certain labels (ex scaffold1253.1_size81005.6.32799_7496) and replace them with more simplified labels (ex scaffold1253.1a). These labels are only in the first column of the text file. I've already written the script such that I have a hash with the old labels as keys corresponding to the new labels as their respective values. This hash has about 26000 lines. So essentially I'd like to take the hash keys 1 by 1, search for them in the text file, and replace them with their respective hash values.
I have a pretty good server availible so if its too complicated to make it first column specific to speed up the process then thats ok.
THis is what I have so far:
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
#gtfarray = <FASTAFILE2>;
#print #gtfarray;
my %hash;
while (<>)
{
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
#print %hash;
while (my ($find, $replace) = each %hash) {
foreach (#gtfarray){
$_ =~ s/$find/$replace/g;
push #newgtf, $_;
}
}
print #newgtf;
This code doesn't seem to work as it doesn't complete. I'm pretty sure it's a problem with the foreach loop structure. Sorry I don't know of any other way to do this. Does anyone have a better way to run through this file and conduct the replacement?
Any input would be greatly appreciated!
Thanks,
Andrew
#DVK
Here is the full script with your mods that runs into syntax errors with your while loop, any idea why it's not accepting it? Thanks again!
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
my %hash;
while (<>){
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
while $line (<FASTAFILE2>){
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n"
}
__END__
Here is the syntax error:
perl gtf_mod2.pl <./Hc_genome/header_file.txt
syntax error at gtf_mod2.pl line 14, near "while $line "
syntax error at gtf_mod2.pl line 23, near "}"
Execution of gtf_mod2.pl aborted due to compilation errors.

You exhaust your file the first time through your loop using the initial $find and $replace key/value pair.
There are two potential solutions:
Open the file for reading during each iteration of your while loop (expensive)
Move the foreach loop to the outside of the while and iterate the hash each time (less expensive)
example:
REPLACE:
for my $line (#gtfarray) {
while(my ($find, $replace) = each %hash) {
if($line =~ s/$find/$replace/g) {
push #newgtf, $line;
next REPLACE; # skip to next iteration
}
}
# if there was no replacement, push the old line
push #newgtf, $line
}

How big is the file that you are replacing the first column in?
If it's >50,000 lines, you are better off doing the reverse:
Iterate through hash file once, and store that hash in memory
Iterate through main file once, and for every line, for every column, find that value in the memorized hash, replace with hash value if found, and write.
In other words, remove the first #gtfarray = <FASTAFILE2>; and replace your last while loop with:
while my $line (<FASTAFILE2>) {
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n";
}
NOTE: I'm making an assumption that the fields contain FULL contents of your hash keys (e.g. your data file would contain a field with "scaffold1253.1_size81005.6.32799_7496" but NOT a field with "XYZscaffold1253.1_size81005.6.32799_7496___IOU").
If that assumption is wrong and you really DO need to run a regex because your scaffold strings may be contained in longer strings, there may still be a better solution aside from running O(N*M) regexes: if your scaffold strings are all of a certain well defined format (e.g. "scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN"), what you need to do then is:
For each line of data file, run a single regex finding that pattern, with the entire pattern inside a capture group parenthesis:
#matches = ($line =~ m/(scaffold\d+\.\d+_size\d+\.\d+\.\d+_\d+/g );
Then, look up every value of #matches array in the hash. If found, run ONLY the matches as a s/// regex.

Looking at your previous post, wouldn't it be more simple to create the shortened 'id' while reading the file. Then you would have no need of the other file where you get your hash?
Here is the (untested) code below. (would need to direct the print statements to an output file on the command line or open a file for writing in your script).
#!/usr/bin/perl
use strict;
use warnings;
my $gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open my $FASTAFILE2, "<", $gtf or die "Unable to open '$gtf' for reading. $!";
my %seen;
while (<$FASTAFILE2>) {
chomp;
my ($id, $val) = split /\t/, $_, 2;
# copy $id to $prefix and
# remove everything after '.1' in $prefix
(my $prefix = $id) =~ s/\.1\K.*//;
if ($seen{$id}) {
++$seen{$id};
}
else {
$seen{$id} = 'a';
}
print "$prefix$seen{$id}\t$val\n";
}
close $FASTAFILE2 or die "Unable to close '$gtf' from reading. $!";

Could it be a job for Tie::File? Assuming, that is, the data file could be operated on as an array.
use Tie::File;
my $file = "./Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf";
tie #lines, 'Tie::File', $file or die ;
for (#lines) {
s/Oldlabel/NewLable/g; # Change this to fit
}
untie #lines ;
Tie::File does a bunch of tricks to keep the "in place " changes to the file memory efficient.

Selectively joining elements of an array into fewer elements of a new array

I'm having some trouble manipulating an array of DNA sequence data that is in .fasta format. What I would specifically like to do is take a file that has a few thousand sequences and adjoin sequence data for each sequence in the file onto a single line in the file. [Fasta format is as such: A sequence ID starts with > after which everything on that line is a description. On the next line(s) the sequence corresponding to this ID is present. And this can continue indefinitely until the next line that begins with >, which is the id of the next sequence in the file] So, in my particular file most of my sequences are on multiple lines, so what I would like to do is essentially remove the newlines, but only the new lines between sequence data, not between sequence data and sequence ID lines (that start with >).
I'm doing this because I want to be able to attain sequence lengths of each sequence (through length, I believe is the easiest way), and then get an average sequence length of all the sequences in the whole file.
Here's my script so far, that doesnt seem to want to work:
#!/usr/bin/perl -w
##Subroutine
sub get_file_data1 {
my($filename) = $_[0];
my #filedata = ();
unless( open(GET_FILE_DATA, $filename)) {
print STDERR "Cannot open file \"$filename\"\n\n";
exit;
}
#filedata = <GET_FILE_DATA>;
close GET_FILE_DATA;
return #filedata;
}
##Opening files
my $fsafile = $ARGV[0];
my #filedata = &get_file_data1($fsafile);
##Procedure
my #count;
my #ids;
my $seq;
foreach $seq (#filedata){
if ($seq =~ /^>/) {push #ids, $seq;
push #count, "\n";
}
else {push #count, $seq;
}
}
foreach my $line (#count) {
if ($line =~ /^[AGTCagtc]/){
$line =~ s/^([AGTCagtc]*)\n/$1/;
}
}
##Make a text file to have a look
open FILE3, "> unbrokenseq.txt" or die "Cannot open output.txt: $!";
foreach (#count)
{
print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;
__END__
##Creating array of lengths
my $number;
my #numberarray;
foreach $number (#count) {
push #numberarray, length($number);
}
print #numberarray;
__END__
use List::Util qw(sum);
sub mean {
return sum(#numberarray)/#numberarray;
}
There's something wrong with the second foreach line of the Procedure section and I can't seem to figure out what it is. Note that the code after the END lines I haven't even tried yet because I cant seem to get the code in the procedure step to do what I want. Any idea how I can get a nice array with elements of unbroken sequence (I've chosen to just remove the sequence ID lines from the new array..)? When I can then get an array of lengths, after which I can then average?
Finally I should unfortunately admit that I cannot get Bio::Perl working on my computer, I have tried for hours but the errors are beyond my skill to fix. Ill be talking to someone who can hopefully help me with my Bio::perl issues. But for now I'm just going to have to press on without it.
Thanks! Sorry for the length of this post, I appreciate the help.
Andrew

The problem with your second loop is that you are not actually changing anything in #count because $line contains a copy of the values in #count.
But, if all you want to do in the second loop is to remove the newline character at the end, use the chomp function. with this you wouldn't need your second loop. (And it would also be faster than using the regex.)
# remove newlines for all array elements before doing anything else with it
chomp #filedata;
# .. or you can do it in your first loop
foreach $seq (#filedata){
chomp $seq;
if ($seq =~ /^>/) {
...
}
An additional tip: Using get_file_data1 to read the entire file into an array might be slow if your files are large. In that case it would be better to iterate through the file as you go:
open my $FILE_DATA, $filename or die "Cannot open file \"$filename\"\n";
while (my $line = <$FILE_DATA>) {
chomp $line;
# process the record as in your Procedure section
...
}
close $FILE_DATA;

Your regex captures specifically to $1 but you are printing $_ to the file. The result being most likely not what you intended.

Be careful with the '*' or 'greedy' modifier to your character groups in s///. You usually want the '+' instead. '*' will also match lines containing none of your characters.
A Search expression with a 'g' modifier can also count characters. Like this:
$perl -e '$a="aggaacaat"; $b = $a =~ s/[a]//g; print $b; '
5
Pretty cool huh! Alternately, in your code, you could just call length() against $1.
I was taken aback to see the escaped '/n' in your regex. While it works fine, the common 'end-of-line' search term is '$'. This is more portable and doesn't mess up your character counts.

parse a huge text file in perl

I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487
This is how my model file looks, I want to have them as :
File A :
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853
File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948
File C:
ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.
Is there any easy way of doing this than going thorough an array?
How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.
Thanks
Sipra

#!/bin/env perl
use strict;
use warnings;
# open three output filehandles
my %fh;
for (qw[A B C]) {
open $fh{$_}, '>', "file$_" or die $!;
}
# open input
open my $in, '<', 'somefile' or die $!;
# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;
while (<$in>) {
chomp;
my #data = split /,/;
print $fh{A} join(',', #data[0 .. 9]), "\n";
print $fh{B} join(',', #data[0, 1, 10 .. 17]), "\n";
print $fh{C} join(',', #data[0, 1, 18 .. $#data]), "\n";
}
Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.
#!/usr/bin/env perl
use strict;
use warnings;
# open input
open my $in, '<', 'somefile' or die $!;
chomp(my $head = <$in>);
my #cols = split/,/, $head;
die 'Invalid number of records - ' . #cols . "\n"
if (#cols -2) % 8;
my #files;
my $name = 'A';
foreach (1 .. (#cols - 2) / 8) {
my %desc;
$desc{start_col} = (($_ - 1) * 8) + 2;
$desc{end_col} = $desc{start_col} + 7;
open $desc{fh}, '>', 'file' . $name++ or die $!;
print {$desc{fh}} join(',', #cols[0,1],
#cols[$desc{start_col} .. $desc{end_col}]),
"\n";
push #files, \%desc;
}
while (<$in>) {
chomp;
my #data = split /,/;
foreach my $f (#files) {
print {$f->{fh}} join(',', #data[0,1],
#data[$f->{start_col} .. $f->{end_col}]),
"\n";
}
}

This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );
#I'm lazy
my #source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;
#read each row
while ( my $raw_line = shift #source_file ) {
$csv->parse($raw_line);
my #fields = $csv->fields();
#get the first 2 ids
my #ids = splice #fields, 0, 2;
my $group = 0;
while (#fields) {
#get the first 8 columns
my #columns = splice #fields, 0, 8;
#if you want to change the separator of the output replace ',' with "\t"
push #{ $output_file->[$group] }, (join ',', #ids, #columns), $/;
$group++;
}
}
#for filename purposes
my $letter = 65;
foreach my $data (#$output_file) {
my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
write_file( $output_filename, #$data );
$letter++;
}
#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (#$output_file) {
# my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
# write_file( $output_filename, #$data );
# $sample_number++;
#}

Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files
perl -F, -lane 'print "#F[0..1] #F[2..9]"' <INPUT_FILE_NAME>

You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?
I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.
I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.
It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.
If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.
The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.
I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.
First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.
I see some sort of processing loop like this:
while (my $line = <$input_fh>) { #Line from the input file.
chomp $line;
my #input_line_array = split /\t/, $line;
my $fileHandle;
foreach my $output_file (#outputFileList) { #List of output files.
$fileHandle = $output_file->{FILE_HANDLE};
my #fieldsToWrite;
foreach my $fieldNumber (#{$output_file->{FIELD_LIST}}) {
push $fieldsToWrite, $input_line_array[$field];
}
say $file_handle join "\t", #fieldsToWrite;
}
}
I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the #input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.
I have a list called #outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in #input_line_array. So if
$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];
Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.
I hope this is making some sense.
The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.
Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.

Neatest way to remove linebreaks in Perl

I'm maintaining a script that can get its input from various sources, and works on it per line. Depending on the actual source used, linebreaks might be Unix-style, Windows-style or even, for some aggregated input, mixed(!).
When reading from a file it goes something like this:
#lines = <IN>;
process(\#lines);
...
sub process {
#lines = shift;
foreach my $line (#{$lines}) {
chomp $line;
#Handle line by line
}
}
So, what I need to do is replace the chomp with something that removes either Unix-style or Windows-style linebreaks.
I'm coming up with way too many ways of solving this, one of the usual drawbacks of Perl :)
What's your opinion on the neatest way to chomp off generic linebreaks? What would be the most efficient?
Edit: A small clarification - the method 'process' gets a list of lines from somewhere, not nessecarily read from a file. Each line might have
No trailing linebreaks
Unix-style linebreaks
Windows-style linebreaks
Just Carriage-Return (when original data has Windows-style linebreaks and is read with $/ = '\n')
An aggregated set where lines have different styles

After digging a bit through the perlre docs a bit, I'll present my best suggestion so far that seems to work pretty good. Perl 5.10 added the \R character class as a generalized linebreak:
$line =~ s/\R//g;
It's the same as:
(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
I'll keep this question open a while yet, just to see if there's more nifty ways waiting to be suggested.

Whenever I go through input and want to remove or replace characters I run it through little subroutines like this one.
sub clean {
my $text = shift;
$text =~ s/\n//g;
$text =~ s/\r//g;
return $text;
}
It may not be fancy but this method has been working flawless for me for years.

$line =~ s/[\r\n]+//g;

Reading perlport I'd suggest something like
$line =~ s/\015?\012?$//;
to be safe for whatever platform you're on and whatever linefeed style you may be processing because what's in \r and \n may differ through different Perl flavours.

Note from 2017: File::Slurp is not recommended due to design mistakes and unmaintained errors. Use File::Slurper or Path::Tiny instead.
extending on your answer
use File::Slurp ();
my $value = File::Slurp::slurp($filename);
$value =~ s/\R*//g;
File::Slurp abstracts away the File IO stuff and just returns a string for you.
NOTE
Important to note the addition of /g , without it, given a multi-line string, it will only replace the first offending character.
Also, the removal of $, which is redundant for this purpose, as we want to strip all line breaks, not just line-breaks before whatever is meant by $ on this OS.
In a multi-line string, $ matches the end of the string and that would be problematic ).
Point 3 means that point 2 is made with the assumption that you'd also want to use /m otherwise '$' would be basically meaningless for anything practical in a string with >1 lines, or, doing single line processing, an OS which actually understands $ and manages to find the \R* that proceed the $
Examples
while( my $line = <$foo> ){
$line =~ $regex;
}
Given the above notation, an OS which does not understand whatever your files '\n' or '\r' delimiters, in the default scenario with the OS's default delimiter set for $/ will result in reading your whole file as one contiguous string ( unless your string has the $OS's delimiters in it, where it will delimit by that )
So in this case all of these regex are useless:
/\R*$// : Will only erase the last sequence of \R in the file
/\R*// : Will only erase the first sequence of \R in the file
/\012?\015?// : When will only erase the first 012\015 , \012 , or \015 sequence, \015\012 will result in either \012 or \015 being emitted.
/\R*$// : If there happens to be no byte sequences of '\015$OSDELIMITER' in the file, then then NO linebreaks will be removed except for the OS's own ones.
It would appear nobody gets what I'm talking about, so here is example code, that is tested to NOT remove line feeds. Run it, you'll see that it leaves the linefeeds in.
#!/usr/bin/perl
use strict;
use warnings;
my $fn = 'TestFile.txt';
my $LF = "\012";
my $CR = "\015";
my $UnixNL = $LF;
my $DOSNL = $CR . $LF;
my $MacNL = $CR;
sub generate {
my $filename = shift;
my $lineDelimiter = shift;
open my $fh, '>', $filename;
for ( 0 .. 10 )
{
print $fh "{0}";
print $fh join "", map { chr( int( rand(26) + 60 ) ) } 0 .. 20;
print $fh "{1}";
print $fh $lineDelimiter->();
print $fh "{2}";
}
close $fh;
}
sub parse {
my $filename = shift;
my $osDelimiter = shift;
my $message = shift;
print "Parsing $message File $filename : \n";
local $/ = $osDelimiter;
open my $fh, '<', $filename;
while ( my $line = <$fh> )
{
$line =~ s/\R*$//;
print ">|" . $line . "|<";
}
print "Done.\n\n";
}
my #all = ( $DOSNL,$MacNL,$UnixNL);
generate 'Windows.txt' , sub { $DOSNL };
generate 'Mac.txt' , sub { $MacNL };
generate 'Unix.txt', sub { $UnixNL };
generate 'Mixed.txt', sub {
return #all[ int(rand(2)) ];
};
for my $os ( ["$MacNL", "On Mac"], ["$DOSNL", "On Windows"], ["$UnixNL", "On Unix"]){
for ( qw( Windows Mac Unix Mixed ) ){
parse $_ . ".txt", #{ $os };
}
}
For the CLEARLY Unprocessed output, see here: http://pastebin.com/f2c063d74
Note there are certain combinations that of course work, but they are likely the ones you yourself naívely tested.
Note that in this output, all results must be of the form >|$string|<>|$string|< with NO LINE FEEDS to be considered valid output.
and $string is of the general form {0}$data{1}$delimiter{2} where in all output sources, there should be either :
Nothing between {1} and {2}
only |<>| between {1} and {2}

In your example, you can just go:
chomp(#lines);
Or:
$_=join("", #lines);
s/[\r\n]+//g;
Or:
#lines = split /[\r\n]+/, join("", #lines);
Using these directly on a file:
perl -e '$_=join("",<>); s/[\r\n]+//g; print' <a.txt |less
perl -e 'chomp(#a=<>);print #a' <a.txt |less

To extend Ted Cambron's answer above and something that hasn't been addressed here: If you remove all line breaks indiscriminately from a chunk of entered text, you will end up with paragraphs running into each other without spaces when you output that text later. This is what I use:
sub cleanLines{
my $text = shift;
$text =~ s/\r/ /; #replace \r with space
$text =~ s/\n/ /; #replace \n with space
$text =~ s/ / /g; #replace double-spaces with single space
return $text;
}
The last substitution uses the g 'greedy' modifier so it continues to find double-spaces until it replaces them all. (Effectively substituting anything more that single space)