Perl - remove carriage return and append next line - perl

What if I have a record in a otherwise good file that had a carriage return in it.
Ex:
1,2,3,4,5 ^M
,6,7,8,9,10
and I wanted to make it
1,2,3,4,5,6,7,8,9,10

In general, if you have a string with a stray newline at the end that you want to get rid of, you can use chomp on it (note that you can pass it an lvalue, so wrapping it around an assignment is legal):
my $string = $string2 = "blah\n";
chomp $string;
# this works too:
chomp(my $string3 = $string2);
Note that if the string has a trailing "\r\n", chomp won't take the \r as well, unless you modify $/.
So if all of that is too complicated, and you need to remove all occurrences of \n, \r\n and \r (maybe you're processing lines from a variety of architectures all at once?), you can fall back to good old tr:
$string =~ tr/\r\n//d;

Say we have a file that contains a ctrl-M (aka \r on some platforms):
$ cat input
1,2,3
4,5,6
,7,8,9
10,11,12
This is explicit with od:
$ od -c input
0000000 1 , 2 , 3 \n 4 , 5 , 6 \r \n , 7 ,
0000020 8 , 9 \n 1 0 , 1 1 , 1 2 \n
0000035
Remove each offending character and join its line with the next by running
$ perl -pe 's/\cM\cJ?//g' input
1,2,3
4,5,6,7,8,9
10,11,12
or redirect to a new file with
$ perl -pe 's/\cM\cJ?//g' input >updated-input
or overwrite it in place (plus a backup in input.bak) with
$ perl -i.bak -pe 's/\cM\cJ?//g' input
Making the \cJ optional handles the case when a file ends with ctrl-M but not ctrl-J.

s/[\r\n]//g
Only do this if you want to combine a line with the next.

Assuming the carriage return is right before the line feed:
perl -pi.bak -e 's/\r\n//' your_file_name
This will join only lines with a carriage return at the end of the line to the next line.

Every line is ended with some terminator sequence, either
CRLF (\r\n = 13, 10) on Windows/DOS
CR (\n = 13) on Unix
LF (\r = 10) on MacOS
If some lines are OK, you should say from wich system the file comes or on wich system the perl script is executed, or the risk is to remove every end of line and merge all of your program lines...
As ^M is the CR character, if you see such a character at the end of a line and nothing special on other lines, you are probably using some kind of Unix (Linux ?) and some copy/paste has polluted one line with an additional \r at the end of line.
if this is the case :
perl -pi -e 's/\r\n$//g' filetomodify
will do the trick and merge only the line containing both CR and LF with the next line, leaving the other lines untounhed.

More Information Needed
More information is needed about the underlying data and what your definition of carriage return is. Is the data in Linux or Windows? Really, do you mean carriage return/line feed, or just line feed?
Some Options:
$text =~ tr/\r//; → this is the fastest method to weed out carriage returns
$text =~ tr/\n//; → this is the fastest method to change newlines
$test =~ s/\n//s; → this is probably what you're looking for, which makes the text appear as one line and removes the internal \n

Related

Unable to remove carriage returns and line feeds in columns enclosed in double quotes [duplicate]

This question already has answers here:
What's the most robust way to efficiently parse CSV using awk?
(6 answers)
Closed 5 years ago.
I want to remove any non printable new line characters in the column data.
I have enclosed all the columns with double quotes to delete the new line characters present in the column easily and to ignore the record delimiter after each end of line.
Say,I have 4 columns seperated by comma and enclosed by quotes in a text file.
I'm trying to remove \n and \r characters only if it is present in between the double quotes
Currently used trim,but it deleted every line break and made it a sequence file without any record seperator.
tr -d '\n\r' < in.txt > out.txt
Sample data:
"1","test\n
Sample","data","col4"\n
"2\n
","Test","Sample","data" \n
"3","Sam\n
ple","te\n
st","data"\n
Expected Output:
"1","testSample","data","col4"\n
"2","Test","Sample","data" \n
"3","Sample","test","data"\n
Any suggestions ? Thanks in advance
With GNU sed
sed ':a;N;$!ba;s/\("[^\n\r]*\)[\n\r\]*\([^\n\r]*\"\)/\1\2/g' file
See this post for the newline replacement without the enclosing ".
Could you please try awk solution and let me know if this helps you.
awk '{gsub(/\r/,"");printf("%s%s",$0,$0~/,$/?"":RS)}' Input_file
Output will be as follows.
"1","test","Sample","data"\n
"2","Test" \n
"3","Sample"
Explanation: Using printf to print the lines, so using 2 %s(it is used for printing strings in printf) here, first %s simply prints the current line, second one will check if a line is ending with comma(,) if yes then it will not print anything else it will print a new line. Add gsub(/\r/,"") before printf in case you want to remove carriage returns and want to get the expected output shown by you too.
EDIT: As your post title suggests to remove carriage returns, so in case you want to remove carriage returns then you could try following. Though you should be mentioning your problem clearly.
tr -d '\r' < Input_file > temp_file && mv temp_file Input_file
Above will remove the carriage characters from your Input_file and save it in the same Input_file too.
Here's a possible solution:
perl -pe 'if (tr/"// % 2) { chomp; $_ .= <>; redo; }'
If the current line has unbalanced quotes (i.e. an odd number of "), it must end in the middle of a field, so we chomp out the newline, append the next input line, and restart the loop.

put all separate paragraphs of a file into a separate line

I have a file that contains sequence data, where each new paragraph (separated by two blank lines) contain a new sequence:
#example
ASDHJDJJDMFFMF
AKAKJSJSJSL---
SMSM-....SKSKK
....SK
SKJHDDSNLDJSCC
AK..SJSJSL--HG
AHSM---..SKSKK
-.-GHH
and I want to end up with a file looking like:
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
each sequence is the same length (if that helps).
I would also be looking to do this over multiple files stored in different directiories.
I have just tried
sed -e '/./{H;$!d;}' -e 'x;/regex/!d' ./text.txt
however this just deleted the entire file :S
any help would bre appreciated - doesn't have to be in sed, if you know how to do it in perl or something else then that's also great.
Thanks.
All you're asking to do is convert a file of blank-lines-separated records (RS) where each field is separated by newlines into a file of newline-separated records where each field is separated by nothing (OFS). Just set the appropriate awk variables and recompile the record:
$ awk '{$1=$1}1' RS= OFS= file
ASDHJDJJDMFFMFAKAKJSJSJSL---SMSM-....SKSKK....SK
SKJHDDSNLDJSCCAK..SJSJSL--HGAHSM---..SKSKK-.-GHH
awk '
/^[[:space:]]*$/ {if (line) print line; line=""; next}
{line=line $0}
END {if (line) print line}
'
perl -00 -pe 's/\n//g; $_.="\n"'
For multiple files:
# adjust your glob pattern to suit,
# don't be shy to ask for assistance
for file in */*.txt; do
newfile="/some/directory/$(basename "$file")"
perl -00 -pe 's/\n//g; $_.="\n"' "$file" > "$newfile"
done
A Perl one-liner, if you prefer:
perl -nle 'BEGIN{$/=""};s/\n//g;print $_' file
The $/ variable is the equivalent of awk's RS variable. When set to the empty sting ("") it causes two or more empty lines to be treated as one empty line. This is the so-called "paragraph-mode" of reading. For each record read, all newline characters are removed. The -l switch adds a newline to the end of each output string, thus giving the desired result.
just try to find those double linebreaks: \n or \r and replace first those with an special sign like :$:
after that you replace every linebreak with an empty string to get the whole file in one line.
next, replace your special sign with a simple line break :)

sed: joining lines depending on the second one

I have a file that, occasionally, has split lines. The split is signaled by the fact that the line starts with '+' (possibly preceeded by spaces).
line 1
line 2
+ continue 2
line 3
...
I'd like join the split line back:
line 1
line 2 continue 2
line 3
...
using sed. I'm not clear how to join a line with the preceeding one.
Any suggestion?
This might work for you:
sed 'N;s/\n\s*+//;P;D' file
These are actually four commands:
N
Append line from the input file to the pattern space
s/\n\s*+//
Remove newline, following whitespace and the plus
P
print line from the pattern space until the first newline
D
delete line from the pattern space until the first newline, e.g. the part which was just printed
The relevant manual page parts are
Selecting lines by numbers
Addresses overview
Multiline techniques - using D,G,H,N,P to process multiple lines
Doing this in sed is certainly a good exercise, but it's pretty trivial in perl:
perl -0777 -pe 's/\n\s*\+//g' input
I'm not partial to sed so this was a nice challenge for me.
sed -n '1{h;n};/^ *+ */{s// /;H;n};{x;s/\n//g;p};${x;p}'
In awk this is approximately:
awk '
NR == 1 {hold = $0; next}
/^ *\+/ {$1 = ""; hold=hold $0; next}
{print hold; hold = $0}
END {if (hold) print hold}
'
If the last line is a "+" line, the sed version will print a trailing blank line. Couldn't figure out how to suppress it.
You can use Vim in Ex mode:
ex -sc g/+/-j -cx file
g global search
- select previous line
j join with next line
x save and close
Different use of hold space with POSIX sed... to load the entire file into the hold space before merging lines.
sed -n '1x;1!H;${g;s/\n\s*+//g;p}'
1x on the first line, swap the line into the empty hold space
1!H on non-first lines, append to the hold space
$ on the last line:
g get the hold space (the entire file)
s/\n\s*+//g replace newlines preceeding +
p print everything
Input:
line 1
line 2
+ continue 2
+ continue 2 even more
line 3
+ continued
becomes
line 1
line 2 continue 2 continue 2 even more
line 3 continued
This (or potong's answer) might be more interesting than a sed -z implementation if other commands were desired for other manipulations of the data you can simply stick them in before 1!H, while sed -z is immediately loading the entire file into the pattern space. That means you aren't manipulating single lines at any point. Same for perl -0777.
In other words, if you want to also eliminate comment lines starting with *, add in /^\s*\*/d to delete the line
sed -n '1x;/^\s*\*/d;1!H;${g;s/\n\s*+//g;p}'
versus:
sed -z 's/\n\s*+//g;s/\n\s*\*[^\n]*\n/\n/g'
The former's accumulation in the hold space line by line keeps you in classic sed line processing land, while the latter's sed -z dumps you into what could be some painful substring regexes.
But that's sort of an edge case, and you could always just pipe sed -z back into sed. So +1 for that.
Footnote for internet searches: This is SPICE netlist syntax.
A solution for versions of sed that can read NUL separated data, like here GNU Sed's -z:
sed -z 's/\n\s*+//g'
Compared to potong's solution this has the advantage of being able to join multiple lines that start with +. For example:
line 1
line 2
+ continue 2
+ continue 2 even more
line 3
becomes
line 1
line 2 continue 2 continue 2 even more
line 3

How to remove carriage returns in the middle of a line

I have file that is read by application in unix and windows. However I am encountering problems when reading in windows with ^M in the middle of the data. I am only wanting to remove the ^M in the middle of the lines such as field 4 and field 5.
I have tried using perl -pe 's/\cM\cJ?//g' but it removes everything into one line which i don't want. I want the data to stay in the same line but remove the extra ones
# Comment^M
# field1_header|field2_header|field3_header|field4_header|field5_header|field6_header^M
#^M
field1|field2|field3|fie^Mld4|fiel^Md5|field6^M
^M
Thanks
To just remove CR in the middle of a line:
perl -pe 's/\r(?!\n)//g'
You can also write this perl -pe 's/\cM(?!\cJ)//g'. The ?! construct is a negative look-ahead expression. The pattern matches a CR, but only when it is not followed by a LF.
Of course, if producing a file with unix newlines is acceptable, you can simply strip all CR characters:
perl -pe 'tr/\015//d'
What you wrote, s/\cM\cJ?//g, strips a CR and the LF after it if there is one, because the LF is part of the matched pattern.
Sounds like the easiest solution might be to check your filetype before moving between unix and windows. dos2unix and unix2dos might be what you really need, instead of a regex.
I'm not sure what character ^M is supposed to be, but carriage return is \015 or \r. So, s/\r//g should suffice. Remember it also removes your last carriage return, if that is something you wish to preserve.
use strict;
use warnings;
my $a = "field1|field2|field3|fie^Mld4|fiel^Md5|field6^M";
$a =~ s/\^M(?!$)//g;
print $a;

What's the clearest way to replace trailing backslash \ with \n?

I want multi-line strings in java, so I seek a simple preprocessor to convert C-style multi-lines into single lines with a literal '\n'.
Before:
System.out.println("convert trailing backslashes\
this is on another line\
\
\
above are two blank lines\
But don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
After:
System.out.println("convert trailing backslashes\nthis is on another line\n\n\nabove are two blank lines\nBut don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
I thought sed would do it well, but sed is line-based, so replacing the '\' and the newline that follows it (effectively joining the two lines) is not very natural in sed. I adapted sredden79's oneliner to the following - it works, it's clever, but it's not clear:
sed ':a { $!N; s/\\\n/\\n/; ta }'
The substitute is of escaped literal backslash, newline with escaped literal backslash, n. :a is a label and ta is goto label if the substitute found a match; $ means the last line, and $! is the opposite (i.e. all lines but the last). N means to append the next line to the pattern space (thus making the \n character visible.)
EDIT here's a variation to keep compiler error line numbers etc accurate: it turns each extended line into "..."+\n (and handles the first and last lines of the String correctly):
sed ':a { $!N; s/\\\n/\\n"+\n"/; ta }'
giving:
System.out.println("convert trailing backslashes\n"+
"this is on another line\n"+
"\n"+
"\n"+
"above are two blank lines\n"+
"But don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
EDIT Actually, it would be better have Perl/Python style multi-line, where it starts and ends with a special code on one line (""" for python, I think).
Is there a simpler, saner, clearer way (maybe not using sed)?
Is there a simpler, saner, clearer way.
Forget the pre-processor, live with the limitation, complain about it (so that it will maybe be fixed in Java 7 or 8), and use an IDE to ease the pain.
Other alternatives (too troublesome I suppose, but still better than messing with the compilation process):
use a JVM-based language that does support here-docs
externalize the string into a resource file
A perl one-liner:
perl -0777 -pe 's/\\\n/\\n/g'
This will read either stdin or the file(s) named after it on the command line and write the output to stdout.
If you're using an editor that supports filtering, like vi or emacs, just filter your text through the above command and you're done:
If you're using Windows and have to worry about \r :
C:\> perl -0777 -pe "s/\\\r?\n/\\n/g"
although I think win32 Perl handles \r itself so this may be unnecessary.
The -0777 option is a special case of the -0 (that's a zero) option that defines the line or record separator. In this case, it means that we don't want any separator so read the entire file in as a single string.
The -pe option is a combination of -p (process line-by-line and print the result) and -e (next argument is (a line of) the program to execute)
A perl script to what you asked for.
while (<>) {
chomp;
print $_;
if (/\\$/) {
print "n";
} else {
print "\n";
}
}
sed 's/\x5c\x5c$/\x22\x5c\x5cn\x22/'
Hex for backslash and double quote is \x5c and \x22 respectively - it needs to be escaped so \x5c is doubled and the $ anchors to the end of the line.
Updated again per OP comment:
sed "{:a;N;\$!b a};s/\x5c\x5c\n/\x5c\x5cn/g"
The :a creates a label and the N appends a line to the pattern space, the b a branches back to the label :a except when its the last line $!;
After its all loaded - a single line substitution replaces all occurrences of a newline \n with a literal '\n' using the hex ascii code \x5c for the backslash.