What does \x do in print - perl

I would like to start by saying that I am not familiar with Perl. That being said, I came across this piece of code and I could not figure out what the \x was for in the code below. In addition, I was unsure why nothing was displayed when I ran the following:
perl -e 'print "\x7c\x8e\x04\x08"'

It's not about print: it's about string representation, in which codes represent characters from your character set. For more information you should read Quote and Quote-like Operators and Effects of Character Semantics
In your case the character code is in hex. You should look in your character set table, and you may need to convert to decimal first.

You said "I was unsure why nothing was displayed when I ran the following:"
perl -e 'print "\x7c\x8e\x04\x08"'
That command outputs 4 characters to STDOUT. Each of the characters is specified in hexadecimal. The "\x7c" part will output the vertical bar character |. The other three characters are control characters, so probably wouldn't produce any visible output. If you redirect output to a file, you will end up with a 4 byte file.
It's possible that you're not seeing the vertical bar character because it's being overwritten by your command prompt. Unlike the shell echo or Python's print, Perl's print function does not automatically append a newline to all output. If you want new lines, you can insert them in the string using \n.

\x signifies the start of a hexadecimal character notation.

Related

Issue matching Chinese characters in Perl one liner using \p{script=Han}

I'm really stumped by trying to match Chinese characters using a Perl one liner in zsh. I canot get \p{script=Han} to match Chinese characters, but \P{script=Han} does.
Task:
I need to change this:
一
<lb/> 二
to this:
<tag ref="一二">一
<lb/> 二</tag>
There could be a variable number of tags, newlines, whitespaces, tabs, alphanumeric characters, digits, etc. between the two Chinese characters. I believe the most efficient and robust way to do this would be to look for something that is *not a Chinese character.
My attempted solution:
perl -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g'
This has the desired effect when applied to the example above.
Problem:
The issue I am having is that \P{script=Han} (or \p{^script=Han}) matches Chinese characters as well.
When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters. When trying to match \P{script=Han}, the regex matches every character in the file.
I don't know why.
This is a problem because in the case of this situation, the output is not as desired:
一
<lb/> 三二
becomes
<tag ref="一二">一
<lb/> 三二</tag>
I don't want this to be matched at all- just instances where 一 and 二 are separated only by characters that are not Chinese characters.
Can anyone tell me what I'm doing wrong? Or suggest a workaround? Thanks!
When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters.
The problem is that both your script and your input file are UTF-8 encoded, but you do not say so to perl. If you do not tell perl, it will assume that they are ASCII encoded.
To say that your script is UTF-8 encoded, use the utf8 pragma. To tell perl that all files you open are UTF-8 encoded, use the -CD command line option. So the following oneliner should solve your problem:
perl -Mutf8 -CD -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g' file

I have two problems converting data using Perl?

I want to convert this input {111,222},{333,444},{555,666} so that the output be like this 111;222;333;444;555;666. The same thing goes for {111,222},{333,444} to 111;222;333;444 and {111,222} to 111;222.
I want to convert this date format from 2019-11-12 15:25:19 to 11/12/2019 15:25:19 and 2019-11-04T15:26:11.000+00:00 to 11/12/2019 15:25:19.
You could use a simple script like this:
#!/usr/bin/perl -pl
s/^\{//;
s/\}$//;
s/\}?,\{?/;/g;
The shebang line says read input lines and print the modified form of each one (-p) and handle line endings automatically (-l).
The first s/// removes a leading {. The second removes a trailing }. The last replaces },{ and , with ; across the line.
You could use a simple script like this (assuming that you really want 2019-11-04T15:26:11.000+00:00 mapped to 11/04/2019 15:26:11 and not the other value which is 8 days bar 52 seconds after the input value):
#!/usr/bin/perl -pl
s%(\d{4})-(\d\d)-(\d\d)[T ](\d\d:\d\d:\d\d).*%$2/$3/$1 $4%;
The shebang line is the same as before. The regex looks for 4 digits, dash, 2 digits, dash, 2 digits, blank or T, an hh:mm:ss string and stray junk and reformats it in the correct sequence.
Between them, they produce the desired results.
Of course, this being Perl, TMTOWTDI — timtoady, or "there's more than one way to do it":
$ cat data
{111,222},{333,444},{555,666}
{111,222},{333,444}
{111,222}
$ cat data.2
2019-11-12 15:25:19
2019-11-04T15:26:11.000+00:00
$ perl -pl -e 's/^\{//; s/\}$//; s/\}?,\{?/;/g; s%(\d{4})-(\d\d)-(\d\d)[T ](\d\d:\d\d:\d\d).*%$2/$3/$1 $4%;' data data.2
111;222;333;444;555;666
111;222;333;444
111;222
11/12/2019 15:25:19
11/04/2019 15:26:11
$
Just transliterating sequences of non-digits to ;:
$data =~ y/0-9/;/cs;

Why does perls length function return different results for same input?

I'm puzzled as to how this perl JAPH works:
perl -e 'print chr(length($_)/3) for #ARGV' Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​! Perl​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​!
Why does length return a different output for what looks like the same argument? (Perl!)
Copy and paste this command into a text editor(in my case, it's vim) you'll find there's a lot of <200b> character between "Perl" and "!".
It's called 'ZERO WIDTH SPACE', and it's UNICODE character. Unicode Character 'ZERO WIDTH SPACE'
These are actually all different strings, all of which contain invisible characters. You can see it if you open the text in a hex editor:

What's the clearest way to replace trailing backslash \ with \n?

I want multi-line strings in java, so I seek a simple preprocessor to convert C-style multi-lines into single lines with a literal '\n'.
Before:
System.out.println("convert trailing backslashes\
this is on another line\
\
\
above are two blank lines\
But don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
After:
System.out.println("convert trailing backslashes\nthis is on another line\n\n\nabove are two blank lines\nBut don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
I thought sed would do it well, but sed is line-based, so replacing the '\' and the newline that follows it (effectively joining the two lines) is not very natural in sed. I adapted sredden79's oneliner to the following - it works, it's clever, but it's not clear:
sed ':a { $!N; s/\\\n/\\n/; ta }'
The substitute is of escaped literal backslash, newline with escaped literal backslash, n. :a is a label and ta is goto label if the substitute found a match; $ means the last line, and $! is the opposite (i.e. all lines but the last). N means to append the next line to the pattern space (thus making the \n character visible.)
EDIT here's a variation to keep compiler error line numbers etc accurate: it turns each extended line into "..."+\n (and handles the first and last lines of the String correctly):
sed ':a { $!N; s/\\\n/\\n"+\n"/; ta }'
giving:
System.out.println("convert trailing backslashes\n"+
"this is on another line\n"+
"\n"+
"\n"+
"above are two blank lines\n"+
"But don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
EDIT Actually, it would be better have Perl/Python style multi-line, where it starts and ends with a special code on one line (""" for python, I think).
Is there a simpler, saner, clearer way (maybe not using sed)?
Is there a simpler, saner, clearer way.
Forget the pre-processor, live with the limitation, complain about it (so that it will maybe be fixed in Java 7 or 8), and use an IDE to ease the pain.
Other alternatives (too troublesome I suppose, but still better than messing with the compilation process):
use a JVM-based language that does support here-docs
externalize the string into a resource file
A perl one-liner:
perl -0777 -pe 's/\\\n/\\n/g'
This will read either stdin or the file(s) named after it on the command line and write the output to stdout.
If you're using an editor that supports filtering, like vi or emacs, just filter your text through the above command and you're done:
If you're using Windows and have to worry about \r :
C:\> perl -0777 -pe "s/\\\r?\n/\\n/g"
although I think win32 Perl handles \r itself so this may be unnecessary.
The -0777 option is a special case of the -0 (that's a zero) option that defines the line or record separator. In this case, it means that we don't want any separator so read the entire file in as a single string.
The -pe option is a combination of -p (process line-by-line and print the result) and -e (next argument is (a line of) the program to execute)
A perl script to what you asked for.
while (<>) {
chomp;
print $_;
if (/\\$/) {
print "n";
} else {
print "\n";
}
}
sed 's/\x5c\x5c$/\x22\x5c\x5cn\x22/'
Hex for backslash and double quote is \x5c and \x22 respectively - it needs to be escaped so \x5c is doubled and the $ anchors to the end of the line.
Updated again per OP comment:
sed "{:a;N;\$!b a};s/\x5c\x5c\n/\x5c\x5cn/g"
The :a creates a label and the N appends a line to the pattern space, the b a branches back to the label :a except when its the last line $!;
After its all loaded - a single line substitution replaces all occurrences of a newline \n with a literal '\n' using the hex ascii code \x5c for the backslash.

What's the difference between single and double quotes in Perl?

I am just begining to learn Perl. I looked at the beginning perl page and started working.
It says:
The difference between single quotes and double quotes is that single quotes mean that their contents should be taken literally, while double quotes mean that their contents should be interpreted
When I run this program:
#!/usr/local/bin/perl
print "This string \n shows up on two lines.";
print 'This string \n shows up on only one.';
It outputs:
This string
shows up on two lines.
This string
shows up on only one.
Am I wrong somewhere?
the version of perl below:
perl -v
This is perl, v5.8.5 built for aix
Copyright 1987-2004, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'. If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.
I am inclined to say something is up with your shell/terminal, and whatever you are outputting to is interpreting the \n as a newline and that the problem is not with Perl.
To confirm: This Shouldn't Happen(TM) - in the first case I would expect to see a new line inserted, but with single quotes it ought to output literally the characters \n and not a new line.
In Perl, single-quoted strings do not expand backslash-escapes like \n or \t. The reason you're seeing them expanded is probably due to the nature of the shell that you're using, which is munging your output for some reason.
Everything you need to know about quoting and quote-like operators is in perlop.
To answer your specific question, double-quotes can turn certain sequences of literal characters into other characters. In your example, the double quotes turn the sequence of characters \ and n into the single character that represents a newline. In a single quoted string, that same literal sequence is just the literal \ and n characters.
By "interpreted", they mean that variable names and such will not be printed, but their values instead. \n is an escape sequence, so I'd think it would not be interpreted.
In addition to your O'Reilly link, a reference no less authoritative than the 'Programming Perl' book by Larry Wall, states that backslash interpolation does not occur in single quoted strings.
... much like Unix shell quotes: double quoted string literals are subject to
backslash and variable interpolation; single quoted strings are not
(except for \' and \\, so that you may ...)
Programing Perl, 2nd ed, 1996 page 16
So it would be interesting to see what your Perl does with
print 'Double backslash n: \\n';
As above, please show us the output from 'perl -v'.
And I believe I have confused the forum editor software, because that last Perl 'print' should have indented.
If you use the double quote it will be interpreted the \n as a newline.
But if you use the single quote it will not interpreted the \n as a newline.
For me it is working correctly.
file content
print "This string \n shows up on two lines.";
print 'This string \n shows up on only one.'