Perl automatically adds newline to output file - perl

I am using Perl to write to a file. It keeps adding a newline to the output file in the same spot even after I use chomp. I cannot figure out why.
Sample Code (reading from an input file, processing the line and then writing that line out to the output file):
open(OUT, "> out.txt");
# ...
while(<STDIN>) {
# ...
my $var = substr($_, index($_, "as "));
chomp($var);
print("Var is: " . $var); # no newline
print OUT $var . ","; # adds newline before the comma
# ...
}
# ...
close(OUT);
Any ideas as to what might be causing this or how to fix it? Thanks.

The cannonical procedure:
while(<STDIN>) {
chomp;
# ...
my $var = substr($_, index($_, "as "));
print("Var is: " . $var); # no newline
print OUT $var . ","; # adds newline before the comma
# ...
}
In most operating systems, lines in files are terminated by newlines.
Just what is used as a newline may vary from OS to OS. Unix
traditionally uses \012 , one type of DOSish I/O uses \015\012 , Mac
OS uses \015 , and z/OS uses \025 .
Perl uses \n to represent the "logical" newline, where what is logical
may depend on the platform in use. In MacPerl, \n always means \015 .
On EBCDIC platforms, \n could be \025 or \045 . In DOSish perls, \n
usually means \012 , but when accessing a file in "text" mode, perl
uses the :crlf layer that translates it to (or from) \015\012 ,
depending on whether you're reading or writing. Unix does the same
thing on ttys in canonical mode. \015\012 is commonly referred to as
CRLF.
To trim trailing newlines from text lines use chomp(). With default
settings that function looks for a trailing \n character and thus
trims in a portable way.
In this case you're hitting a cross-platform barrier, you're reading documents written on an os, from a different and not compatible platform.
For an isolated execution you should covert your file line endings to match the host.
To address the issue permanently, you can try: https://metacpan.org/pod/File::Edit::Portable . Thanks #stevieb

Related

Print Line By Line

I've been trying to work on a lyrical bot for my server, but before I started to work on it, I wanted to give it a test so I came up with this script using the Lyrics::Fetcher module.
use strict;
use warnings;
use Lyrics::Fetcher;
my ($artist, $song) = ('Coldplay', 'Adventures Of A Lifetime');
my $lyrics = Lyrics::Fetcher->fetch($artist, $song, [qw(LyricWiki AstraWeb)]);
my #lines = split("\n\r", $lyrics);
foreach my $line (#lines) {
sleep(10);
print $line;
}
This script works fine, it grabs the lyrics and prints it out in a whole(which is not what I'm looking for).
I was hoping to achieve a line by line print of the lyrics every 10 seconds. Help please?
Your call to split looks suspicious. In particular the regex "\n\r". Note, the first argument to split is always interpreted as a regex regardless of whether you supply a quoted string.
On Unix systems the line ending is typically "\n". On DOS/Windows it's "\r\n" (the reverse of what you have). On ancient Macs it was "\r". To match all thre you could do:
my #lines = split(/\r\n|\n|\r/, $lyrics);
You will need to enable autoflush, otherwise the lines will just be buffered and printed when the buffer is full or when the program terminates
STDOUT->autoflush;
You can use the regex generic newline pattern \R to split on any line ending, whether your data contains CR, LF, or CR LF. This feature is available only in Perl v5.10 or better
my #lines = split /\R/, $lyrics;
And you will need to print a newline after each line of lyrics, because the split will have removed them
print $line, "\n";

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}
It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}
You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.
If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

Perl non-English Character

See this piece of perl code:
#!/usr/bin/perl -w -CS
use feature 'unicode_strings';
open IN, "<", "wiki.txt";
open OUT, ">", "wikicorpus.txt";
binmode( IN, ':utf8' );
binmode( OUT, ':utf8' );
## Condition plain text English sentences or word lists into a form suitable for constructing a vocabulary and language model
while (<IN>) {
# Remove starting and trailing tags (e.g. <s>)
# s/\<[a-z\/]+\>//g;
# Remove ellipses
s/\.\.\./ /g;
# Remove unicode 2500 (hex E2 94 80) used as something like an m-dash between words
# Unicode 2026 (horizontal ellipsis)
# Unicode 2013 and 2014 (m- and n-dash)
s/[\x{2500}\x{2026}\x{2013}\x{2014}]/ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\?,;:%???!()\[\]{}<>_\.])/ /g;
# Remove quotes
s/[\p{Initial_Punctuation}\p{Final_Punctuation}]/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/'' / /g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/?/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/\s+/ /g;
# Remove leading space
s/^\s+//;
chomp($_);
print OUT uc($_) . "\n";
# print uc($_) . " ";
} print OUT "\n";
It seems that there is a non-english character on line 49, namely the line s/?/e/g;.
So when I run this, warning come out that Quantifier follows nothing in regex;.
How can I deal with this problem? How to make perl recognize the character? I have to run this code with perl 5.10.
Another little question is that what is the meaning of the "-CS" in the 1st line.
Thanks to all.
I think your problem is that your editor doesn't handle unicode characters, so the program is trashed before it even gets to perl, and as this apparently isn't your program, it was probably trashed before it got to you.
Until the entire tool chain handles unicode correctly, you have to be careful to encode non-ascii characters in a way that preserves them. It's a pain, and no simple solutions exist. Consult your perl manual for how to embed unicode characters safely.
As per the comment line just before the erroneous line, the character to be replaced is an accented "e"; presumably what is meant is e with an acute accent: "é". Assuming your input is Unicode, it can be represented in Perl as \x{00E9}. See also http://www.fileformat.info/info/unicode/char/e9/index.htm
I guess you copy/pasted this script from a web page on a server which was not properly configured to display the required character encoding. See further also http://en.wikipedia.org/wiki/Mojibake

Perl New Line separator issue

I have a file that uses CR/LF to separate records, but individual records sometimes contain a LF.
while (<$in>)
{
#extract record data
}
I am trying to read this code as above and this (as I would expect) splits the records that contain a LF only. I would however have expected that a reassigned $/ would resolve this issue but it does appear to cause the complete file to me read in one iteration.
$/ = "\r\n";
while (<$in>)
{
#extract record data
}
Anyone here who can suggest a working solution?
I am using Activestate Perl on Windows.
On windows, perl converts the incoming CRLF line endings to LF only, making a distinction between CRLF and LF impossible by reading in the data as text (perlport). Therefore, you have to read your data in binary mode using binmode on your file-handle:
binmode($in);
After that, you can set the input record separator to "\015\012" and read-in your records as usual:
$/ = "\015\012";
while (<$in>) {
...
}
greets, Matthias
PS: I have no chance to test that locally, at the moment, so I regret if it does not work.
Try setting $/ to "\n". From Newlines in perlport:
Perl uses \n to represent the "logical" newline, where what is logical
may depend on the platform in use. In MacPerl, \n always means \015.
In DOSish perls, \n usually means \012, but when accessing a file in
"text" mode, perl uses the :crlf layer that translates it to (or from)
\015\012, depending on whether you're reading or writing.
try this before while
binmode($in);

is there a way to designate the line token delimiter in Perl's file reader?

I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm guessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.
Yes. You'll want to overwrite the value of $/. From perlvar
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)
local $/; # enable "slurp" mode
local $_ = <FH>; # whole file now here
s/\n[ \t]+/ /g;
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.
On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.
See also "Newlines" in perlport. Also see $..
The variable has multiple names:
$/
$RS
$INPUT_RECORD_SEPARATOR
For the longer names, you need:
use English;
Remember to localize carefully:
{
local($/) = "\r\n";
...code to read...
}
If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.
open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";
This will transparently convert \r\n sequences into \n sequences.
You can also apply this translation to an existing handle by doing:
binmode( $fh, ':crlf' );
:crlf mode is typically default in Win32 Perl environments and works very well in practice.
For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.
But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.
So:
perl -ln -0777 -e 'my #lines = split /\R/;
print length($_), " bytes split into ", scalar(#lines), " lines."' $YOUR_FILE
or in your script:
{
local $/ = undef;
open F, $YOUR_FILE or die;
#lines = split /\R/, <F>;
close F;
}
\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.
From the perldoc :
\R matches a generic newline; that is, anything considered a linebreak
sequence by Unicode. This includes all characters matched by \v
(vertical whitespace), and the multi character sequence "\x0D\x0A"
(carriage return followed by a line feed, sometimes called the network
newline; it's the end of line sequence used in Microsoft text files
opened in binary mode)
Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.