Perl - unknown end of line character - perl

I want to read an input file line by line, but this file has unknown ending character.
Editor vim does not know it either, it represents this character as ^A and immediately starts with characters from new line. The same is for perl. It tried to load all lines in once, because it ignores these strange end of line character.
How can I set this character as end of line for perl? I don't want to use any special module for it (because of our strict system), I just want to define the character (maybe in hex code) of end of line.
The another option is to convert the file to another one, with good end of line character (replace them). Can I make it in some easy way (something like sed on input file)? But everything need to be done in perl.
It is possible?
Now, my reading part looks like:
open (IN, $in_file);
$event=<IN>; # read one line

The ^A character you mention is the "start of heading" character. You can set the special Perl variable $/ to this character. Although, if you want your code to be readable and editable by the guy who comes after you (and uses another editor), I would do something like this:
use English;
local $INPUT_RECORD_SEPARATOR = "\cA" # 'start of heading' character
while (<>)
{
chomp; # remove the unwanted 'start of heading' character
print $_ . "\n";
}
From Perldoc:
$INPUT_RECORD_SEPARATOR
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is.
More on special character escaping on PerlMonks.
Oh and if you want, you can enter the "start of heading" character in VI, in insert mode, by pressing CTRL+V, then CTRL+A.
edit: added local per Drt's suggestion

Related

What does \x do in print

I would like to start by saying that I am not familiar with Perl. That being said, I came across this piece of code and I could not figure out what the \x was for in the code below. In addition, I was unsure why nothing was displayed when I ran the following:
perl -e 'print "\x7c\x8e\x04\x08"'
It's not about print: it's about string representation, in which codes represent characters from your character set. For more information you should read Quote and Quote-like Operators and Effects of Character Semantics
In your case the character code is in hex. You should look in your character set table, and you may need to convert to decimal first.
You said "I was unsure why nothing was displayed when I ran the following:"
perl -e 'print "\x7c\x8e\x04\x08"'
That command outputs 4 characters to STDOUT. Each of the characters is specified in hexadecimal. The "\x7c" part will output the vertical bar character |. The other three characters are control characters, so probably wouldn't produce any visible output. If you redirect output to a file, you will end up with a 4 byte file.
It's possible that you're not seeing the vertical bar character because it's being overwritten by your command prompt. Unlike the shell echo or Python's print, Perl's print function does not automatically append a newline to all output. If you want new lines, you can insert them in the string using \n.
\x signifies the start of a hexadecimal character notation.

perl multiline find and replace, can't get newline working

I have a directory on a Linux host with several property files which I want to edit by replacing hardcoded values with placeholder tags. My goal was to have a perl script which reads a delimited file that contains entries for each of the property files listing the hardcoded value, the placeholder value and the name of the file to edit.
For example, in file.prop I have these values set
<connection targetHostUrl="99.99.99.99"
targetHostPort="9999"
And I want to replace the values with tags as shown below
<connection targetHostUrl="TARGETHOST"
targetHostPort="PORT"
There will be several entries similar to this so I have to match on the unique combination of IP and PORT so I need a multiline match.
To do this I wrote the following script to take the input of the delimited filename, which is delimited with ||. I go get that file from the config directory and read in the values to get the hardcoded value, tag, and filename to edit. Then I read in that property file, do the substitution and then write it out again.
#!/usr/bin/perl
my $config = $ARGV[0];
chomp $config;
my $filename = '/config/' . $config;
my ($hard,$tagg,$prop);
open(DATAFILE, $filename) or die "Could not open DATAFILE $filename.";
while(<DATAFILE>)
{
chomp $_;
($hard,$tagg,$prop) = split('\|\|', $_);
$*=1;
open(INPUT,"</properties/$prop") or die "Could not open INPUT $prop.";
#input_array=<INPUT>;
close(INPUT);
$input_scalar=join("",#input_array);
$input_scalar =~ s/$hard/$tagg/;
open(OUTPUT,">/properties/$prop") or die "Could not open OUTPUT $prop.";
print(OUTPUT $input_scalar);
close(OUTPUT);
}
close DATAFILE;
Inside the config file I have the following entry
<connection targetHostUrl="99.99.999.99"(.|\n)*?targetHostPort="9999"||<connection targetHostUrl="TARGETHOST1"\n targetHostPort="PORT"||file.prop
My output is as shown below. It puts what I hoped to be a newline as a literal \n
<connection targetHostUrl="TARGETHOST"\n targetHostPort="PORT"
I can't find a way to get the \n taken as a newline. At first I thought, no problem, I'll just do a 2nd substitution like
perl -i -pe 's/\\n/\n/o' $prop
and although this works, for some reason it puts ^M characters at the end of every line except the one I did the replacement on. I don't want to do a 3rd replace to strip them out.
I've searched and found other ways of doing the multiline search/replace but they all interpret the \n literally.
Any suggestions?
My output is as shown below. It puts what I hoped to be a newline as a literal \n
Why would it insert a newline when the string doesn't contain one?
I can't find a way to get the \n taken as a newline.
There isn't any. If you want to substitute a newline, you need to provide a newline.
If you used a proper CSV parser like Text::CSV_XS, you could put a newline in your data file.
Otherwise, you'll have to write some code to handle the escape sequences you want your code to handle.
for some reason it puts ^M characters at the end of every line except the one I did the replacement on.
Quite the opposite. It removes it from the one line you did the replacement on.
That's home some programs represent a Carriage Return. You have a file with CR LF line ends. You could use dos2unix to convert it, or you could leave it as is because XML doesn't care.

Decipher this sed one-liner

I want to remove duplicate lines from a file, without sorting the file.
Example of why this is useful to me: removing duplicates from Bash's $HISTFILE without changing the chronological order.
This page has a one-liner to do that:
http://sed.sourceforge.net/sed1line.txt
Here's the one-liner:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
I asked a sysadmin and he told me "you just copy the script and it works, don't go philosophising about this", which is fine, so I am asking here as it's a developer forum and I trust people might be like me, suspicious about using things they don't understand:
Could you kindly provide a pseudo-code explanation of what that "black magic" script is doing, please? I tried parsing the incantation in my head but especially the central part is quite hard.
I'll note that this script does not appear to work with my copy of sed (GNU sed 4.1.5) in my current locale. If I run it with LC_ALL=C it works fine.
Here's an annotated version of the script. sed basically has two registers, one is called "pattern space" and is used for (basically) the current input line, and the other, the "hold space", can be used by scripts for temporary storage etc.
sed -n ' # -n: by default, do not print
G # Append hold space to current input line
s/\n/&&/ # Add empty line after current input line
/^\([ -~]*\n\).*\n\1/d # If the current input line is repeated in the hold space, skip this line
# Otherwise, clean up for storing all input in hold space:
s/\n// # Remove empty line after current input line
h # Copy entire pattern space back to hold space
P # Print current input line'
I guess the adding and removal of an empty line is there so that the central pattern can be kept relatively simple (you can count on there being a newline after the current line and before the beginning of the matching line).
So basically, the entire input file (sans duplicates) is kept (in reverse order) in the hold space, and if the first line of the pattern space (the current input line) is found anywhere in the rest of the pattern space (which was copied from the hold space when the script started processing this line), we skip it and start over.
The regex in the conditional can be further decomposed;
^ # Look at beginning of line (i.e. beginning of pattern space)
\( # This starts group \1
[ -~] # Any printable character (in the C locale)
* # Any number of times
\n # Followed by a newline
\) # End of group \1 -- it contains the current input line
.*\n # Skip any amount of lines as necessary
\1 # Another occurrence of the current input line, with newline and all
If this pattern matches, the script discards the pattern space and starts over with the next input line (d).
You can get it to work independently of locale by changing [ -~] to [[:print:]]
The code doesn't work for me, perhaps due to some locale setting, but this does:
vvv
sed -n 'G; s/\n/&&/; /^\([^\n]*\n\).*\n\1/d; s/\n//; h; P'
^^^
Let's first translate this by the book (i.e. sed info page), into something perlish.
# The standard sed loop
my $hold = "";
while ($my pattern = <>) {
chomp $pattern;
$pattern = "$pattern\n$hold"; # G
$pattern =~ s/(\n)/$1$1/; # s/\n/&&/
if ($pattern =~ /^([^\n]*\n).*\n\1/) { # /…/
next; # d
}
$pattern =~ s/\n//; # s/\n//
$hold = $pattern; # h
$pattern =~ /^([^\n]*\n?)/; print $1; # P
}
OK, the basic idea is that the hold space contains all the lines seen so far.
G: At the beginning of each cycle, append that hold space to the current line. Now we have a single string consisting of the current line and all unique lines which preceeded it.
s/\n/&&/: Turn the newline which separates them into a double newline, so that we can match subsequent and non-subsequent duplicates the same, see the next step.
^\([^\n]*\n\).*\n\1/: Look through the current text for the following: at the beginning of all the lines (^) look for a first line including trailing newline (\([^\n]*\n\)), then anything (.*), then a newline (\n), and then that same first line including newline repeated again (\1). If two subsequent lines are the same, then the .* in the regular expression will match the empty string, but the two \n will still match due to the newline duplication in the preceding step. So basically this asks whether the first line appears again among the other lines.
d: If there is a match, this is a duplicate line. We discard this input, keep the hold space as it is as a buffer of all unique lines seen so far, and continue with the next line of input.
s/\n//: Otherwise, we continue and next turn the double newline back into a single newline.
h: We include the current line in our list of all unique lines.
P: And finally print this new unique line, up to the newline character.
For the actual problem to resolve, here is a simpler solution (at least it looks so) with awk:
awk '!_[$0]++' FILE
In short _[$0] is a counter (of appearance) for each unique line, for any line ($0) appearing for the second time _[$0] >= 1, thus !_[$0] evaluates to false, causing it not to be printed except its first time appearance.
See https://gist.github.com/ryenus/5866268 (credit goes to a recent forum I visited.)

How to remove carriage returns in the middle of a line

I have file that is read by application in unix and windows. However I am encountering problems when reading in windows with ^M in the middle of the data. I am only wanting to remove the ^M in the middle of the lines such as field 4 and field 5.
I have tried using perl -pe 's/\cM\cJ?//g' but it removes everything into one line which i don't want. I want the data to stay in the same line but remove the extra ones
# Comment^M
# field1_header|field2_header|field3_header|field4_header|field5_header|field6_header^M
#^M
field1|field2|field3|fie^Mld4|fiel^Md5|field6^M
^M
Thanks
To just remove CR in the middle of a line:
perl -pe 's/\r(?!\n)//g'
You can also write this perl -pe 's/\cM(?!\cJ)//g'. The ?! construct is a negative look-ahead expression. The pattern matches a CR, but only when it is not followed by a LF.
Of course, if producing a file with unix newlines is acceptable, you can simply strip all CR characters:
perl -pe 'tr/\015//d'
What you wrote, s/\cM\cJ?//g, strips a CR and the LF after it if there is one, because the LF is part of the matched pattern.
Sounds like the easiest solution might be to check your filetype before moving between unix and windows. dos2unix and unix2dos might be what you really need, instead of a regex.
I'm not sure what character ^M is supposed to be, but carriage return is \015 or \r. So, s/\r//g should suffice. Remember it also removes your last carriage return, if that is something you wish to preserve.
use strict;
use warnings;
my $a = "field1|field2|field3|fie^Mld4|fiel^Md5|field6^M";
$a =~ s/\^M(?!$)//g;
print $a;

Reading a large file in perl, record by record, with a dynamic record separator

I have a script that reads a large file line by line. The record separator ($/) that I would like to use is (\n). The only problem is that the data on each line contains CRLF characters (\r\n), which the program should not be considered the end of a line.
For example, here is a sample data file (with the newlines and CRLFs written out):
line1contents\n
line2contents\n
line3\r\ncontents\n
line4contents\n
If I set $/ = "\n", then it splits the third line into two lines. Ideally, I could just set $/ to a regex that matches \n and not \r\n, but I don't think that's possible. Another possibility is to read in the whole file, then use the split function to split on said regex. The only problem is that the file is too large to load into memory.
Any suggestions?
For this particular task, it sounds pretty straightforward to check your line ending and append the next line as necessary:
$/ = "\n";
...
while(<$input>) {
while( substr($_,-2) eq "\r\n" ) {
$_ .= <$input>;
}
...
}
This is the same logic used to support line continuation in a number of different programming contexts.
You are right that you can't set $/ to a regular expression.
dos2unix would put a UNIX newline character in for the "\r\n" and so wouldn't really solve the problem. I would use a regex that replaces all instances of "\r\n" with a space or tab character and save the results to a different file (since you don't want to split the line at those points). Then I would run your script on the new file.
Try using dos2unix on the file first, and then read in as normal.