Remove CRLF end of csv file using Perl - perl

I have a csv file ending with CRLF in each row record. Using Perl, how to remove the CRLF only in the final record of the file so that there is no empty record row at the end of file? Thank you.

If I follow the question correctly, there is a line feed trailing the last record, which is creating an empty row at the end of the file.
You can read the file into a scalar and remove the trailing, blank row with a substitution. \R will work on new Perl versions (5.10, I think) and will match any system's line break, otherwise you'll need to use \n or \n\r
open $fh, '<', 'test.csv';
while (<$fh>) {
$str .= $_;
}
$str =~ s/\R+(.*)\R+/$1/s;

Related

How to remove newline from the end of a file using Perl

I have a file that reads like this:
dog cat mouse
apple orange pear
red yellow green
There is a tab \t separating the words on each row, and a newline \n separating each of the rows. Below the last line, red yellow green there is a blank line due to a newline \n after green.
I would like to use Perl to remove the newline.
I have seen a few articles like this How can I delete a newline if it is the last character in a file? that give solutions for Perl, but I would like to do this in hard code so that I can incorporate it into my Perl script.
I don't know if this might be possible using chomp, or if chomp works on each line separately (I would like to keep the newline between lines).
Also I have seen previously comments that suggest maintaining a newline at the end of a file because Unix commands work better when a file ends with a newline. However, I have created a script which relies on input files not ending with a newline, therefore I really feel removing the newlines is necessary for my work.
You can try this:
perl -pe 'chomp if eof' file.txt
Here is another simple way, if you need it in a script:
open $fh, "file.txt";
#lines=<$fh>; # read all lines and store in array
close $fh;
chomp $lines[-1]; # remove newline from last line
print #lines;
Or something like this (in script), as suggested by jnhc for the command line:
open $fh, "file.txt";
while (<$fh>) {
chomp if eof $fh;
print;
}
close $fh;

Removing bullet points from a txt file using perl

I am writing a perl script to process a text file. I need to remove bullet points from the text file and create a new one without bullets. When I look at the binary version of the text file, the bullet is stored as a unicode bullet (0xe280a2). How do I remove the bullet from a string.
I have tried the following code:
open($filehandle, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while ($row = <$filehandle>)
{
#txt_str = split(/\•/, $row);
$row = join(" ",#txt_str);
}
The backslash doesn't help you here, as the bullet is not a special character in regexes.
If you specify the input is UTF-8, you should search for a UTF-8 bullet. To do so, either prepend
use utf8;
and save your script as UTF-8; or, use
\N{BULLET}
In your case, splitting and joining can be replaced by simple replacement of the bullet by a space:
while (<$filehandle>) {
s/\N{BULLET}/ /g; # or s/•/ /g under utf8
print; # <-- this was missing in your code
}
why not use use a simple s/•/ /g instead of splitting/joining? and you should print the resulted variable ($row in your case) to an other file or stdout, otherwise you won't see the 'unbulleted' version
but for this task i'd use sed from the command line, i'm pretty sure it can handle unicode characters too

Perl 5.12.3 fails to loop CSV file line by line

I'm sure someone has an explanation as to what is happening with the following script:
Please note, the file I specify is available and is opening. I know this because the last line of the file is output when the program is run, but it is only the last line.
Note about the .csv file: it's generated on windows (I'm using OS X 10.7.4 with Perl 5.12.3) and uses \r line breaks. I attempted to tell perl that the line break character was \r at the top of the script but it does not work. I know they're \r as the grep search finds them in a text editor.
The script runs and only prints the last line of the file. If I plug in a regular expression it will grab the first matching field from the first line and echo it fine, but I cannot iterate over the entire file.
Any clarification is appreciated as I am new to perl.
#!/usr/bin/perl
use warnings;
print "Please enter your filename:";
my ($dataline);
open(INFO,'./expensereport.csv') || die("can't open datafile: $!");
while (my $line = <INFO>) {
chomp $line;
print $line;
}
print $!;
The carriage returns without linefeed are causing print to overwrite each line on the same line, so all you see is the last.
Run dos2unix on your input file before processing.
There are several ways to tell perl that your input file is windows-style :crlf.
perldoc -f binmode or perldoc -f open
open(INFO, '<:crlf', './expensereport.csv')
...
Ahh, that's clear! :)
Look, you have a file with \r (carriage return, literally) and \n (newline). chomp cuts off \n (new line). So you print over the same line (remember "carriage return") again and again.
Use print "$line\n"; instead

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}
It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}
You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.
If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

CR vs LF perl parsing

I have a perl script which parses a text file and breaks it up per line into an array.
It works fine when each line are terminated by LF but when they terminate by CR my script is not handling properly.
How can I modify this line to fix this
my #allLines = split(/^/, $entireFile);
edit:
My file has a mixture of lines with either
ending LF or ending CR it just collapses all lines when its ending in CR
Perl can handle both CRLF and LF line-endings with the built-in :crlf PerlIO layer:
open(my $in, '<:crlf', $filename);
will automatically convert CRLF line endings to LF, and leave LF line endings unchanged. But CR-only files are the odd-man out. If you know that the file uses CR-only, then you can set $/ to "\r" and it will read line-by-line (but it won't change the CR to a LF).
If you have to deal with files of unknown line endings (or even mixed line endings in a single file), you might want to install the PerlIO::eol module. Then you can say:
open(my $in, '<:raw:eol(LF)', $filename);
and it will automatically convert CR, CRLF, or LF line endings into LF as you read the file.
Another option is to set $/ to undef, which will read the entire file in one slurp. Then split it on /\r\n?|\n/. But that assumes that the file is small enough to fit in memory.
If you have mixed line endings, you can normalize them by matching a generalized line ending:
use v5.10;
$entireFile =~ s/\R/\n/g;
You can also open a filehandle on a string and read lines just like you would from a file:
open my $fh, '<', \ $entireFile;
my #lines = <$fh>;
close $fh;
You can even open the string with the layers that cjm shows.
You can probably just handle the different line endings when doing the split, e.g.:
my #allLines = split(/\r\n|\r|\n/, $entireFile);
It will automatically split the input into lines if you read with <>, but you need to you need to change $/ to \r.
$/ is the "input record separator". see perldoc perlvar for details.
There is not any way to change what a regular expression considers to be the end-of-line - it's always newline.