Neatest way to remove linebreaks in Perl - perl

I'm maintaining a script that can get its input from various sources, and works on it per line. Depending on the actual source used, linebreaks might be Unix-style, Windows-style or even, for some aggregated input, mixed(!).
When reading from a file it goes something like this:
#lines = <IN>;
process(\#lines);
...
sub process {
#lines = shift;
foreach my $line (#{$lines}) {
chomp $line;
#Handle line by line
}
}
So, what I need to do is replace the chomp with something that removes either Unix-style or Windows-style linebreaks.
I'm coming up with way too many ways of solving this, one of the usual drawbacks of Perl :)
What's your opinion on the neatest way to chomp off generic linebreaks? What would be the most efficient?
Edit: A small clarification - the method 'process' gets a list of lines from somewhere, not nessecarily read from a file. Each line might have
No trailing linebreaks
Unix-style linebreaks
Windows-style linebreaks
Just Carriage-Return (when original data has Windows-style linebreaks and is read with $/ = '\n')
An aggregated set where lines have different styles

After digging a bit through the perlre docs a bit, I'll present my best suggestion so far that seems to work pretty good. Perl 5.10 added the \R character class as a generalized linebreak:
$line =~ s/\R//g;
It's the same as:
(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
I'll keep this question open a while yet, just to see if there's more nifty ways waiting to be suggested.

Whenever I go through input and want to remove or replace characters I run it through little subroutines like this one.
sub clean {
my $text = shift;
$text =~ s/\n//g;
$text =~ s/\r//g;
return $text;
}
It may not be fancy but this method has been working flawless for me for years.

$line =~ s/[\r\n]+//g;

Reading perlport I'd suggest something like
$line =~ s/\015?\012?$//;
to be safe for whatever platform you're on and whatever linefeed style you may be processing because what's in \r and \n may differ through different Perl flavours.

Note from 2017: File::Slurp is not recommended due to design mistakes and unmaintained errors. Use File::Slurper or Path::Tiny instead.
extending on your answer
use File::Slurp ();
my $value = File::Slurp::slurp($filename);
$value =~ s/\R*//g;
File::Slurp abstracts away the File IO stuff and just returns a string for you.
NOTE
Important to note the addition of /g , without it, given a multi-line string, it will only replace the first offending character.
Also, the removal of $, which is redundant for this purpose, as we want to strip all line breaks, not just line-breaks before whatever is meant by $ on this OS.
In a multi-line string, $ matches the end of the string and that would be problematic ).
Point 3 means that point 2 is made with the assumption that you'd also want to use /m otherwise '$' would be basically meaningless for anything practical in a string with >1 lines, or, doing single line processing, an OS which actually understands $ and manages to find the \R* that proceed the $
Examples
while( my $line = <$foo> ){
$line =~ $regex;
}
Given the above notation, an OS which does not understand whatever your files '\n' or '\r' delimiters, in the default scenario with the OS's default delimiter set for $/ will result in reading your whole file as one contiguous string ( unless your string has the $OS's delimiters in it, where it will delimit by that )
So in this case all of these regex are useless:
/\R*$// : Will only erase the last sequence of \R in the file
/\R*// : Will only erase the first sequence of \R in the file
/\012?\015?// : When will only erase the first 012\015 , \012 , or \015 sequence, \015\012 will result in either \012 or \015 being emitted.
/\R*$// : If there happens to be no byte sequences of '\015$OSDELIMITER' in the file, then then NO linebreaks will be removed except for the OS's own ones.
It would appear nobody gets what I'm talking about, so here is example code, that is tested to NOT remove line feeds. Run it, you'll see that it leaves the linefeeds in.
#!/usr/bin/perl
use strict;
use warnings;
my $fn = 'TestFile.txt';
my $LF = "\012";
my $CR = "\015";
my $UnixNL = $LF;
my $DOSNL = $CR . $LF;
my $MacNL = $CR;
sub generate {
my $filename = shift;
my $lineDelimiter = shift;
open my $fh, '>', $filename;
for ( 0 .. 10 )
{
print $fh "{0}";
print $fh join "", map { chr( int( rand(26) + 60 ) ) } 0 .. 20;
print $fh "{1}";
print $fh $lineDelimiter->();
print $fh "{2}";
}
close $fh;
}
sub parse {
my $filename = shift;
my $osDelimiter = shift;
my $message = shift;
print "Parsing $message File $filename : \n";
local $/ = $osDelimiter;
open my $fh, '<', $filename;
while ( my $line = <$fh> )
{
$line =~ s/\R*$//;
print ">|" . $line . "|<";
}
print "Done.\n\n";
}
my #all = ( $DOSNL,$MacNL,$UnixNL);
generate 'Windows.txt' , sub { $DOSNL };
generate 'Mac.txt' , sub { $MacNL };
generate 'Unix.txt', sub { $UnixNL };
generate 'Mixed.txt', sub {
return #all[ int(rand(2)) ];
};
for my $os ( ["$MacNL", "On Mac"], ["$DOSNL", "On Windows"], ["$UnixNL", "On Unix"]){
for ( qw( Windows Mac Unix Mixed ) ){
parse $_ . ".txt", #{ $os };
}
}
For the CLEARLY Unprocessed output, see here: http://pastebin.com/f2c063d74
Note there are certain combinations that of course work, but they are likely the ones you yourself naĆ­vely tested.
Note that in this output, all results must be of the form >|$string|<>|$string|< with NO LINE FEEDS to be considered valid output.
and $string is of the general form {0}$data{1}$delimiter{2} where in all output sources, there should be either :
Nothing between {1} and {2}
only |<>| between {1} and {2}

In your example, you can just go:
chomp(#lines);
Or:
$_=join("", #lines);
s/[\r\n]+//g;
Or:
#lines = split /[\r\n]+/, join("", #lines);
Using these directly on a file:
perl -e '$_=join("",<>); s/[\r\n]+//g; print' <a.txt |less
perl -e 'chomp(#a=<>);print #a' <a.txt |less

To extend Ted Cambron's answer above and something that hasn't been addressed here: If you remove all line breaks indiscriminately from a chunk of entered text, you will end up with paragraphs running into each other without spaces when you output that text later. This is what I use:
sub cleanLines{
my $text = shift;
$text =~ s/\r/ /; #replace \r with space
$text =~ s/\n/ /; #replace \n with space
$text =~ s/ / /g; #replace double-spaces with single space
return $text;
}
The last substitution uses the g 'greedy' modifier so it continues to find double-spaces until it replaces them all. (Effectively substituting anything more that single space)

Related

Search and replace a string in a file

I'm trying to read contents from an input file, copy only certain lines of code from the file and print in an output file.
Certain lines of code is determined by:
Code name to determine the first line (IP1_NAME or IP2_NAME)
Pattern to determine the last line (END_OF_LIST)
Input file:
IP1_NAME
/ip1name/ip1dir/ //CLIENT_NAME/ip1name/ip1dir
/ip1testname/ip1testdir/ //CLIENT_NAME/ip1testname/ip1testdir
END_OF_LIST
IP2_NAME
/ip2name/ip2dir/ //CLIENT_NAME/ip2name/ip2dir
/ip2testname/ip2testdir/ //CLIENT_NAME/ip2testname/ip2testdir
END_OF_LIST
Output file:
(If IP1_NAME is chosen and the CLIENT_NAME should be replaced by tester_ip)
/ip1name/ip1dir/ //tester_ip/ip1name/ip1dir
/ip1testname/ip1testdir/ //tester_ip/ip1testname/ip1testdir
You could use the following one-liner to pull out the lines between the two patterns:
perl -0777 -ne 'print "$1\n" while /IP1_NAME(.*?)END_OF_LIST/gs' in.txt > out.txt
Where in.txt is your input file and out.txt is the output file.
This use case is actually described in perlfaq6: Regular Expressions.
You can then modify the output file to replace CLIENT_NAME with tester_ip:
perl -pi -e 's/CLIENT_NAME/tester_ip/' y.txt
As a script instead of a one-liner, using the scalar range operator:
#/usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
process('input.txt', qr/^IP1_NAME$/, qr/^END_OF_LIST$/, 'tester_ip');
sub process {
my ($filename, $startpat, $endpat, $newip) = #_;
open my $file, '<', $filename;
while (my $line = <$file>) {
chomp $line;
if ($line =~ /$startpat/ .. $line =~ /$endpat/) {
next unless $line =~ /^\s/; # Skip the start and lines.
$line =~ s/^\s+//; # Remove indentation
$line =~ s/CLIENT_NAME/$newip/g; # Replace with desired value
say $line;
}
}
}
Running this on your sample input file produces:
/ip1name/ip1dir/ //tester_ip/ip1name/ip1dir
/ip1testname/ip1testdir/ //tester_ip/ip1testname/ip1testdir
I am assuming there is additional stuff in your input file, otherwise we would not have to jump through the hoops with these start and end markers as and we could just say
perl -ne "print if /^ /"
and that would be silly, right ;-)
So, the flipflop has potential problems as I stated in my comment. And while clever, it does not buy you that much in terms of readability or verbosement (verbocity?), since you have to test again anyway in order to not process the marker lines.
As long as there is no exclusive flip flop operator, I would go for a more robust solution.
my $in;
while (<DATA>) {
$in = 1, next if /^IP\d_NAME/;
$in = 0 if /^END_OF_LIST/;
if ( $in )
{
s/CLIENT_NAME/tester_ip/;
print;
}
}
__DATA__
cruft
IP1_NAME
/ip1name/ip1dir/ //CLIENT_NAME/ip1name/ip1dir
/ip1testname/ip1testdir/ //CLIENT_NAME/ip1testname/ip1testdir
END_OF_LIST
more
cruft
IP2_NAME
/ip2name/ip2dir/ //CLIENT_NAME/ip2name/ip2dir
/ip2testname/ip2testdir/ //CLIENT_NAME/ip2testname/ip2testdir
END_OF_LIST
Lore Ipsargh!

Replace multiple hex values

I've written the following script because I need to do some cleanup in some files. I have a specific number of hex characters that needs to be changed into another set of hex characters (ie null to space, see below). I've written the following script, my problem is that it only replaces the first occurence and nothing else.
I've tried the /g just like a regular sed pattern but it doesnt work. Is there a way to do this and replace all matches?
(The reason i havent used a $line =~ s/... is because I think its neater and more maintainable that way, and this script will need to be accessed and run on occasion by others who may need to edit the hex values to be replaced). Another reason is because i need to change from 10+ hex values to an equivalent amount, so a huge one liner would be hard to read. Thank you in advance.
#!/usr/bin/perl
use strict;
use warnings;
my $filebase = shift || "testreplace.txt";
my $filefilter = shift || "testf";
open my $fh1, '>', 'testreplaceout';
# Iterate over file and read lines
open my $file1, '<', $filebase;
while (my $line = <$file1>)
{
chomp($line);
for ($line) {
s/\x00/\x20/g;
s/\x31/\x32/g;
}
print {$fh1} "$line \n";
}
/g will do what you want. If it doesn't seem to be working, add some debugging:
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
And in your loop:
print Dumper($line);
for ($line) {
s/\x00/\x20/g;
s/\x31/\x32/g;
}
print Dumper($line);
Using tr with paired delimiters instead can be very readable/maintainable:
$line =~ tr[\x00\x31]
[\x20\x32];
Also, consider adding use autodie;
tr/// is probably your best bet here (since you are dealing with constant single character replacements). The following is a more generic solution.
my %replacements = (
'foo' => 'bar',
'bar' => 'baz',
);
my $pat = join '|', map quotemeta, keys(%replacement);
s/($pat)/$replacements{$1}/g;
Update: read comments for caveats of this answer.
Here's one way that'll allow you to keep your list of regex search/replaces at the top of your script nice and clean for ease of viewing and modification:
use warnings;
use strict;
my #re_list = (
['a', 'x'],
['b', 'y'],
);
while (my $line = <DATA>){
for my $re (#re_list){
$line =~ s/$re->[0]/$re->[1]/g;
}
print $line;
}
__DATA__
aaabbbccc
bbbcccddd
ababababa
Output:
xxxyyyccc
yyycccddd
xyxyxyxyx

Why is my Perl code not omitting newlines?

I'm reading this textfile to get ONLY the words in it and ignore all kind of whitespaces:
hello
now
do you see this.sadslkd.das,msdlsa but
i hoohoh
And this is my Perl code:
#!usr/bin/perl -w
require 5.004;
open F1, './text.txt';
while ($line = <F1>) {
#print $line;
#arr = split /\s+/, $line;
foreach $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
#print #arr;
}
close F1;
And this is the output:
hello
now
do
you
see
this.sadslkd.das,msdlsa
but
i
hoohoh
The output is showing two newlines but I am expecting the output to be just words. What should I do to just get words?
You should always use strict and use warnings (in preference to the -w command-line qualifier) at the top of every Perl program, and declare each variable at its first point of use using my. That way Perl will tell you about simple errors that you may otherwise overlook.
You should also use lexical file handles with the three-parameter form of open, and check the status to make sure it succeeded. There is little point in explicitly closing an input file unless you expect your program to run for an appreciable time, as Perl will close all files for you on exit.
Do you really need to require Perl v5.4? That version is fifteen years old, and if there is anything older than that installed then you have a museum!
Your program would be better like this:
use strict;
use warnings;
open my $fh, '<', './text.txt' or die $!;
while (my $line = <$fh>) {
my #arr = split /\s+/, $line;
foreach my $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
}
Note: my apologies. The warnings pragma and lexical file handles were introduced only in v5.6 so that part of my answer is irrelevant. The latest version of Perl is v5.16 and you really should upgrade
As Birei has pointed out, the problem is that, when the line has leading whitespace, there is a empty field before the first separator. Imagine if your data was comma-separated, then you would want Perl to report a leading empty field if the line started with a comma.
To extract all the non-space characters you can use a regular expression that does exactly that
my #arr = $line =~ /\S+/g;
and this can be emulated by using the default parameter for split which is a single quoted space (not a regular expression)
my #arr = $line =~ split ' ', $line;
In this case split behaves like the awk utility and discards any leading empty fields as you expected.
This is even simpler if you let Perl use the $_ variable in the read loop, as all of the parameters for split can be defaulted:
while (<F1>) {
my #arr = split;
foreach my $w (#arr) {
print "$w\n" if $w !~ /^\s+$/;
}
}
This line is the problem:
#arr=split(/\s+/,$line);
\s+ does a match just before the leading spaces. Use ' ' instead.
#arr=split(' ',$line);
I believe that in this line:
if(!($w =~ /^\s+$/))
You wanted to ask if there's nothing in this row - don't print it.
But the "+" in the REGEX actually force it to have at least 1 space.
If you change the "\s+" to "\s*", you'll see that it's working. because * is 0 occurrences or more ...

With Perl, how do I read records from a file with two possible record separators?

Here is what I am trying to do:
I want to read a text file into an array of strings. I want the string to terminate when the file reads in a certain character (mainly ; or |).
For example, the following text
Would you; please
hand me| my coat?
would be put away like this:
$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';
Could I get some help on something like this?
This will do it. The trick to using split while preserving the token you're splitting on is to use a zero-width lookback match: split(/(?<=[;|])/, ...).
Note: mctylr's answer (currently the top rated) isn't actually correct -- it will split fields on newlines, b/c it only works on a single line of the file at a time.
gbacon's answer using the input record separator ($/) is quite clever--it's both space and time efficient--but I don't think I'd want to see it in production code. Putting one split token in the record separator and the other in the split strikes me as a little too unobvious (you have to fight that with Perl ...) which will make it hard to maintain. I'm also not sure why he's deleting multiple newlines (which I don't think you asked for?) and why he's doing that only for the end of '|'-terminated records.
# open file for reading, die with error message if it fails
open(my $fh, '<', 'data.txt') || die $!;
# set file reading to slurp (whole file) mode (note that this affects all
# file reads in this block)
local $/ = undef;
my $string = <$fh>;
# convert all newlines into spaces, not specified but as per example output
$string =~ s/\n/ /g;
# split string on ; or |, using a zero-width lookback match (?<=) to preserve char
my (#strings) = split(/(?<=[;|])/, $string);
One way is to inject another character, like \n, whenever your special character is found, then split on the \n:
use warnings;
use strict;
use Data::Dumper;
while (<DATA>) {
chomp;
s/([;|])/$1\n/g;
my #string = split /\n/;
print Dumper(\#string);
}
__DATA__
Would you; please hand me| my coat?
Prints out:
$VAR1 = [
'Would you;',
' please hand me|',
' my coat?'
];
UPDATE: The original question posed by James showed the input text on a single line, as shown in __DATA__ above. Because the question was poorly formatted, others edited the question, breaking the 1 line into 2. Only James knows whether 1 or 2 lines was intended.
I prefer #toolic's answer because it deals with multiple separators very easily.
However, if you wanted to overly complicate things, you could always try:
#!/usr/bin/perl
use strict; use warnings;
my #contents = ('');
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
$line =~ s{$/}{ };
if ( $line =~ /^([^|;]+[|;])(.+)$/ ) {
$contents[-1] .= $1;
push #contents, $2;
}
else {
$contents[-1] .= $1;
}
}
print "[$_]\n" for #contents;
__DATA__
Would you; please
hand me| my coat?
Something along the lines of
$text = <INPUTFILE>;
#string = split(/[;!]/, $text);
should do the trick more or less.
Edit: I've changed "/;!/" to "/[;!]/".
Let Perl do half the work for you by setting $/ (the input record separator) to vertical bar, and then extract semicolon-separated fields:
#!/usr/bin/perl
use warnings;
use strict;
my #string;
*ARGV = *DATA;
$/ = "|";
while (<>) {
s/\n+$//;
s/\n/ /g;
push #string => $1 while s/^(.*;)//;
push #string => $_;
}
for (my $i = 0; $i < #string; ++$i) {
print "\$string[$i] = '$string[$i]';\n";
}
__DATA__
Would you; please
hand me| my coat?
Output:
$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';

How can I find the strings from one file in another file in Perl?

The script below takes function names in a text file and scans on a
folder that contains multiple c,h files. It opens those files one-by-one and
reads each line. If the match is found in any part of the files, it prints the
line number and the line that contains the match.
Everything is working fine except that the comparison is not working properly. I would be very grateful to whoever solves my problem.
#program starts:
use FileHandle;
print "ENTER THE PATH OF THE FILE THAT CONTAINS THE FUNCTIONS THAT YOU WANT TO
SEARCH: ";#getting the input file
our $input_path = <STDIN>;
$input_path =~ s/\s+$//;
open(FILE_R1,'<',"$input_path") || die "File open failed!";
print "ENTER THE PATH OF THE FUNCTION MODEL: ";#getting the folder path that
#contains multiple .c,.h files
our $model_path = <STDIN>;
$model_path =~ s/\s+$//;
our $last_dir = uc(substr ( $model_path,rindex( $model_path, "\\" ) +1 ));
our $output = $last_dir."_FUNC_file_names";
while(our $func_name_input = <FILE_R1> )#$func_name_input is the function name
#that is taken as the input
{
$func_name_input=reverse($func_name_input);
$func_name_input=substr($func_name_input,rindex($func_name_input,"\("+1);
$func_name_input=reverse($func_name_input);
$func_name_input=substr($func_name_input,index($func_name_input," ")+1);
#above 4 lines are func_name_input is choped and only part of the function
#name is taken.
opendir FUNC_MODEL,$model_path;
while (our $file = readdir(FUNC_MODEL))
{
next if($file !~ m/\.(c|h)/i);
find_func($file);
}
close(FUNC_MODEL);
}
sub find_func()
{
my $fh1 = FileHandle->new("$model_path//$file") or die "ERROR: $!";
while (!$fh1->eof())
{
my $func_name = $fh1->getline(); #getting the line
**if($func_name =~$func_name_input)**#problem here it does not take the
#match
{
next if($func_name=~m/^\s+/);
print "$.,$func_name\n";
}
}
}
$func_name_input=substr($func_name_input,rindex($func_name_input,"\("+1);
You're missing an ending parenthesis. Should be:
$func_name_input=substr($func_name_input,rindex($func_name_input,"\(")+1);
There's probably an easier way than those four statements, too. But it's a little early to wrap my head around it all. Do you want to match "foo" in "function foo() {"? If so, you could use a regex like /\s+([^) ]+)/.
When you say $func_name =~$func_name_input, you're treating all characters in $func_name_input as special regex characters. If this is not what you mean to do, you can use quotemeta (perldoc -f quotemeta): $func_name =~quotemeta($func_name_input) or $func_name =~ qr/\Q$func_name_input\E/.
Debugging will be easier with strictures (and a syntax-hilighting editor). Also note that, if you're not using those variables in other files, "our" doesn't do anything "my" wouldn't do for file-scoped variables.
find + xargs + grep does 90% of what you want.
find . -name '*.[c|h]' | xargs grep -n your_pattern
ack does it even easier.
ack --type=cc your_pattern
Simply take your list of patterns from your file and "or" them together.
ack --type=cc 'foo|bar|baz'
This has the benefit of only search the files once, and not once for each pattern being searched for as you're doing.
I still think you should just use ack, but your code needed some serious love.
Here is an improved version of your program. It now takes the directory to search and patterns on the command line rather than having to ask for (and the user write) files. It searches all the files under the directory, not just the ones in the directory, using File::Find. It does this in one pass by concatenating all the patterns into regular expressions. It uses regexes instead of index() and substr() and reverse() and oh god. It simply uses built in filehandles rather than the FileHandle module and checking for eof(). Everything is declared lexical (my) instead of global (our). Strict and warnings are on for easier debugging.
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
die "Usage: search_directory function ...\n" unless #ARGV >= 2;
my $Search_Dir = shift;
my $Pattern = build_pattern(#ARGV);
find(
{
wanted => sub {
return unless $File::Find::name =~ m/\.(c|h)$/i;
find_func($File::Find::name, $pattern);
},
no_chdir => 1,
},
$Search_Dir
);
# Join all the function names into one pattern
sub build_pattern {
my #patterns;
for my $name (#_) {
# Turn foo() into foo. This replaces all that reverse() and rindex()
# and substr() stuff.
$name =~ s{\(.*}{};
# Use \Q to protect against regex metacharacters in the input
push #patterns, qr{\Q$name\E};
}
# Join them up into one pattern.
return join "|", #patterns;
}
sub find_func {
my( $file, $pattern ) = #_;
open(my $fh, "<", $file) or die "Can't open $file: $!";
while (my $line = <$fh>) {
# XXX not all functions are unindented, but your choice
next if $line =~ m/^\s+/;
print "$file:$.: $line" if $line =~ $pattern;
}
}