How to tokenize Perl source code?

How to tokenize Perl source code? - perl

I have some reasonable (not obfuscated) Perl source files, and I need a tokenizer, which will split it to tokens, and return the token type of each of them, e.g. for the script
print "Hello, World!\n";
it would return something like this:
keyword 5 bytes
whitespace 1 byte
double-quoted-string 17 bytes
semicolon 1 byte
whitespace 1 byte
Which is the best library (preferably written in Perl) for this? It has to be reasonably correct, i.e. it should be able to parse syntactic constructs like qq{{\}}}, but it doesn't have to know about special parsers like Lingua::Romana::Perligata. I know that parsing Perl is Turing-complete, and only Perl itself can do it right, but I don't need absolute correctness: the tokenizer can fail or be incompatible or assume some default in some very rare corner cases, but it should work correctly most of the time. It must be better than the syntax highlighting built into an average text editor.
FYI I tried the PerlLexer in pygments, which works reasonable for most constructs, except that it cannot find the 2nd print keyword in this one:
print length(<<"END"); print "\n";
String
END

PPI

use PPI;
Yes, only perl can parse Perl, however PPI is the 95% correct solution.

Related

Why is the perl print filehandle syntax the way it is?

I am wondering why the perl creators chose an unusual syntax for printing to a filehandle:
print filehandle list
with no comma after filehandle. I see that it's to distinguish between "print list" and "print filehandle list", but why was the ad-hoc syntax preferred over creating two functions - one to print to stdout and one to print to given filehandle?
In my searches, I came across the explanation that this is an indirect object syntax, but didn't the print function exist in perl 4 and before, whereas the object-oriented features came into perl relatively late? Is anyone familiar with the history of print in perl?

Since the comma is already used as the list constructor, you can't use it to separate semantically different arguments to print.
open my $fh, ...;
print $fh, $foo, $bar
would just look like you were trying to print the values of 3 variables. There's no way for the parser, which operates at compile time, to tell that $fh is going to refer to a file handle at run time. So you need a different character to syntactically (not semantically) distinguish between the optional file handle and the values to actually print to that file handle.
At this point, it's no more work for the parser to recognize that the first argument is separated from the second argument by blank space than it would be if it were separated by any other character.

If Perl had used the comma to make print look more like a function, the filehandle would always have to be included if you are including anything to print besides $_. That is the way functions work: If you pass in a second parameter, the first parameter must also be included. There isn't one function I can think of in Perl where the first parameter is optional when the second parameter exists. Take a look at split. It can be written using zero to four parameters. However, if you want to specify a <limit>, you have to specify the first three parameters too.
If you look at other languages, they all include two different ways ways to print: One if you want STDOUT, and another if you're printing to something besides STDOUT. Thus, Python has both print and write. C has both printf and fprintf. However, Perl can do this with just a single statement.
Let's look at the print statement a bit more closely -- thinking back to 1987 when Perl was first written.
You can think of the print syntax as really being:
print <filehandle> <list_to_print>
To print to OUTFILE, you would say:
To print to this file, you would say:
print OUTFILE "This is being printed to myfile.txt\n";
The syntax is almost English like (PRINT to OUTFILE the string "This is being printed to myfile.txt\n"
You can also do the same with thing with STDOUT:
print STDOUT "This is being printed to your console";
print STDOUT " unless you redirected the output.\n";
As a shortcut, if the filehandle was not given, it would print to STDOUT or whatever filehandle the select was set to.
print "This is being printed to your console";
print " unless you redirected the output.\n";
select OUTFILE;
print "This is being printed to whatever the filehandle OUTFILE is pointing to\n";
Now, we see the thinking behind this syntax.
Imagine I have a program that normally prints to the console. However, my boss now wants some of that output printed to various files when required instead of STDOUT. In Perl, I could easily add a few select statements, and my problems will be solved. In Python, Java, or C, I would have to modify each of my print statements, and either have some logic to use a file write to STDOUT (which may involve some conniptions in file opening and dupping to STDOUT.
Remember that Perl wasn't written to be a full fledge language. It was written to do the quick and dirty job of parsing text files more easily and flexibly than awk did. Over the years, people used it because of its flexibility and new concepts were added on top of the old ones. For example, before Perl 5, there was no such things as references which meant there was no such thing as object oriented programming. If we, back in the days of Perl 3 or Perl 4 needed something more complex than the simple list, hash, scalar variable, we had to munge it ourselves. It's not like complex data structures were unheard of. C had struct since its initial beginnings. Heck, even Pascal had the concept with records back in 1969 when people thought bellbottoms were cool. (We plead insanity. We were all on drugs.) However, since neither Bourne shell nor awk had complex data structures, so why would Perl need them?

Answer to "why" is probably subjective and something close to "Larry liked it".
Do note however, that indirect object notation is not a feature of print, but a general notation that can be used with any object or class and method. For example with LWP::UserAgent.
use strict;
use warnings;
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $response = get $ua "http://www.google.com";
my $response_content = decoded_content $response;
print $response_content;
Any time you write method object, it means exactly the same as object->method. Note also that parser seems to only reliably work as long as you don't nest such notations or do not use complex expressions to get object, so unless you want to have lots of fun with brackets and quoting, I'd recommend against using it anywhere except common cases of print, close and rest of IO methods.

Why not? it's concise and it works, in perl's DWIM spirit.
Most likely it's that way because Larry Wall liked it that way.

Evaluating escape sequences in perl

I'm reading strings from a file. Those strings contain escape sequences which I would like to have evaluated before processing them further. So I do:
$t = eval("\"$t\"");
which works fine. But I'm having doubt about the performance. If eval is forking a perl process each time, it will be a performance killer. Another way I considered to do the job were regex, where I have found related questions in SO.
My question: is there a better, more efficient way to do it?
EDIT: before calling eval in my example $t is containing \064\065\x20a\n. It is evaluated to 45 a<LF>.

It's not quite clear what the strings in the file look like and what you do to them before passing off to eval. There's something missing in the explanation.
If you simply want to undo C-style escaping (as also used in Perl), use Encode::Escape:
use Encode qw(decode);
use Encode::Escape qw();
my $string_with_unescaped_literals = decode 'ascii-escape', $string_with_escaped_literals;
If you have placeholders in the file which look like Perl variables that you want to fill with values, then you are abusing eval as a poor man's templating engine. Use a real one that does not have the dangerous side effect of running arbitrary code.

$string =~ s/\\([rnt'"\\])/"qq|\\$1|"/gee
string eval can solve the problem too, but it brings up a host of security and maintenance issues, like # in string

oh gah don't use eval for this, thats dangerous if someone provides it with input like "system('sync;reboot');"..
But, you could do something like this:
#!/usr/bin/perl
$string = 'foo\"ba\\\'r';
printf("%s\n", $string);
$string =~ s/\\([\"\'])/$1/g;
printf("%s\n", $string);

How can I convert CGI input to UTF-8 without Perl's Encode module?

Through this forum, I have learned that it is not a good idea to use the following for converting CGI input (from either an escape()d Ajax call or a normal HTML form post) to UTF-8:
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
utf8::decode $_;
A safer way (which for example does not allow bogus characters through) is to do the following:
use Encode qw (decode);
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
decode ('UTF-8', $_, Encode::FB_CROAK);
I would, however, very much like to avoid using any modules (including XSLoader, Exporter, and whatever else they bring with them). The function is for a high-volume mod_perl driven website and I think both performance and maintainability will be better without modules (especially since the current code does not use any).
I guess one approach would be to examine the Encode module and strip out the functions and constants used for the “decode ('UTF-8', $_, Encode::FB_CROAK)” call. I am not sufficiently familiar with Unicode and Perl modules to do this. Maybe somebody else is capable of doing this or know a similar, safe “native” way of doing the UTF-8 conversion?
UPDATE:
I prefer keeping things non-modular, because then the only black-box is Perl's own compiler (unless of course you dig down into the module libs).
Sometimes you see large modules being replaced with a few specific lines of code. For example, instead of the CGI.pm module (which people are also in love with), one can use the following for parsing AJAX posts:
my %Input;
if ($ENV{CONTENT_LENGTH}) {
read (STDIN, $_, $ENV{CONTENT_LENGTH});
foreach (split (/&/)) {
tr/+/ /; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg;
if (m{^(\w+)=\s*(.*?)\s*$}s) { $Input{$1} = $2; }
else { die ("bad input ($_)"); }
}
}
In a similar way, it would be great if one could extract or replicate Encode's UTF-8 decode function.

Don't pre-optimize. Do it the conventional way first then profile and benchmark later to see where you need to optimize. People usually waste all their time somewhere else, so starting off blindfolded and hadcuffed doesn't give you any benefit.
Don't be afraid of modules. The point of mod_perl is to load up everything as few times as possible so the startup time and module loading time are insignificant.

Don't use escape() to create your posted data. This isn't compatible with URL-encoding, it's a mutant JavaScript oddity which should normally never be used. One of the defects is that it will encode non-ASCII characters to non-standard %uNNNN sequences based on UTF-16 code units, instead of standard URL-encoded UTF-8. Your current code won't be able to handle that.
You should typically use encodeURIComponent() instead.
If you must URL-decode posted input yourself rather than using a form library (and this does mean you won't be able to handle multipart/form-data), you will need to convert + symbols to spaces before replacing %-sequences. This replacement is standard in form submissions (though not elsewhere in URL-encoded data).
To ensure input is valid UTF-8 if you really don't want to use a library, try this regex. It also excludes some control characters (you may want to tweak it to exclude more).

In Perl, can I treat a string as a byte array?

In Perl, is it appropriate to use a string as a byte array containing 8-bit data? All the documentation I can find on this subject focuses on 7-bit strings.
For instance, if I read some data from a binary file into $data
my $data;
open FILE, "<", $filepath;
binmode FILE;
read FILE $data 1024;
and I want to get the first byte out, is substr($data,1,1) appropriate? (again, assuming it is 8-bit data)
I come from a mostly C background, and I am used to passing a char pointer to a read() function. My problem might be that I don't understand what the underlying representation of a string is in Perl.

The bundled documentation for the read command, reproduced here, provides a lot of information that is relevant to your question.
read FILEHANDLE,SCALAR,LENGTH,OFFSET
read FILEHANDLE,SCALAR,LENGTH
Attempts to read LENGTH characters of data into variable SCALAR
from the specified FILEHANDLE. Returns the number of
characters actually read, 0 at end of file, or undef if there
was an error (in the latter case $! is also set). SCALAR will
be grown or shrunk so that the last character actually read is
the last character of the scalar after the read.
An OFFSET may be specified to place the read data at some place
in the string other than the beginning. A negative OFFSET
specifies placement at that many characters counting backwards
from the end of the string. A positive OFFSET greater than the
length of SCALAR results in the string being padded to the
required size with "\0" bytes before the result of the read is
appended.
The call is actually implemented in terms of either Perl's or
system's fread() call. To get a true read(2) system call, see
"sysread".
Note the characters: depending on the status of the filehandle,
either (8-bit) bytes or characters are read. By default all
filehandles operate on bytes, but for example if the filehandle
has been opened with the ":utf8" I/O layer (see "open", and the
"open" pragma, open), the I/O will operate on UTF-8 encoded
Unicode characters, not bytes. Similarly for the ":encoding"
pragma: in that case pretty much any characters can be read.

See perldoc -f pack and perldoc -f unpack for how to treat strings as byte arrays.

You probably want to use sysopen and sysread if you want to read bytes from binary file.
See also perlopentut.
Whether this is appropriate or necessary depends on what exactly you are trying to do.
#!/usr/bin/perl -l
use strict; use warnings;
use autodie;
use Fcntl;
sysopen my $bin, 'test.png', O_RDONLY;
sysread $bin, my $header, 4;
print map { sprintf '%02x', ord($_) } split //, $header;
Output:
C:\Temp> t
89504e47

Strings are strings of "characters", which are bigger than a byte.1 You can store bytes in them and manipulate them as though they are characters, taking substrs of them and so on, and so long as you're just manipulating entities in memory, everything is pretty peachy. The data storage is weird, but that's mostly not your problem.2
When you try to read and write from files, the fact that your characters might not map to bytes becomes important and interesting. Not to mention annoying. This annoyance is actually made a bit worse by Perl trying to do what you want in the common case: If all the characters in the string fit into a byte and you happen to be on a non-Windows OS, you don't actually have to do anything special to read and write bytes. Perl will complain, however, if you have stored a non-byte-sized character and try to write it without giving it a clue about what to do with it.
This is getting a little far afield, largely because encoding is a large and confusing topic. Let me leave it off there with some references: Look at Encode(3perl), open(3perl), perldoc open, and perldoc binmode for lots of hilarious and gory details.
So the summary answer is "Yes, you can treat strings as though they contained bytes if they do in fact contain bytes, which you can assure by only reading and writing bytes.".
1: Or pedantically, "which can express a larger range of values than a byte, though they are stored as bytes when that is convenient". I think.
2: For the record, strings in Perl are internally represented by a data structure called a 'PV' which in addition to a character pointer knows things like the length of the string and the current value of pos.3
3: Well, it will start storing the current value of pos if it starts being interesting. See also
use Devel::Peek;
my $x = "bluh bluh bluh bluh";
Dump($x);
$x =~ /bluh/mg;
Dump($x);
$x =~ /bluh/mg;
Dump($x);

It might help more if you tell us what you are trying to do with the byte array. There are various ways to work with binary data, and each lends itself to a different set of tools.
Do you want to convert the data into a Perl array? If so, pack and unpack are a good start. split could also come in handy.
Do you want to access individual elements of the string without unpacking it? If so, substr is fast and will do the trick for 8 byte data. If you want other bit depths, take a look at the vec function, which treads a string as a bit vector.
Do you want to scan the string and convert certain bytes to other bytes? Then the s/// or tr/// constructs might be useful.

Allow me just to post a small example about treating string as binary array - since I myself found it difficult to believe that something called "substr" would handle null bytes; but seemingly it does - below is a snippet of a perl debugger terminal session (with both string and array/list approaches):
$ perl -d
Loading DB routines from perl5db.pl version 1.32
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
^D
Debugged program terminated. Use q to quit or R to restart,
use o inhibit_exit to avoid stopping after program termination,
h q, h R or h o to get additional info.
DB<1> $str="\x00\x00\x84\x00"
DB<2> print $str
�
DB<3> print unpack("H*",$str) # show content of $str as hex via `unpack`
00008400
DB<4> $str2=substr($str,2,2)
DB<5> print unpack("H*",$str2)
8400
DB<6> $str2=substr($str,1,3)
DB<7> print unpack("H*",$str2)
008400
[...]
DB<30> #stra=split('',$str); print #stra # convert string to array (by splitting at empty string)
�
DB<31> print unpack("H*",$stra[3]) # print indiv. elems. of array as hex
00
DB<32> print unpack("H*",$stra[2])
84
DB<33> print unpack("H*",$stra[1])
00
DB<34> print unpack("H*",$stra[0])
00
DB<35> print unpack("H*",join('',#stra[1..3])) # print only portion of array/list via indexes (using flipflop [two dots] operator)
008400

Perl: Encoding messed up after text concatenation

I have encountered a weird situation while updating/upgrading some legacy code.
I have a variable which contains HTML. Before I can output it, it has to be filled with lots of data. In essence, I have the following:
for my $line (#lines) {
$output = loadstuff($line, $output);
}
Inside of loadstuff(), there is the following
sub loadstuff {
my ($line, $output) = #_;
# here the process is simplified for better understanding.
my $stuff = getOtherStuff($line);
my $result = $output.$stuff;
return $result;
}
This function builds a page which consists of different areas. All area is loaded up independently, that's why there is a for-loop.
Trouble starts right about here. When I load the page from ground up (click on a link, Perl executes and delivers HTML), everything is loaded fine. Whenever I load a second page via AJAX for comparison, that HTML has broken encoding.
I tracked down the problem to this line my $result = $output.$stuff. Before the concatenation, $output and $stuff are fine. But afterward, the encoding in $result is messed up.
Does somebody have a clue why concatenation messes up my encoding? While we are on the subject, why does it only happen when the call is done via AJAX?
Edit 1
The Perl and the AJAX call both execute the very same functions for building up a page. So, whenever I fix it for AJAX, it is broken for freshly reloaded pages. It really seems to happen only if AJAX starts the call.
The only difference in this particular case is that the current values for the page are compared with an older one (it is a backup/restore function). From here, everything is the same. The encoding in the variables (as far as I can tell) are ok. I even tried the Encode functions only on the values loaded from AJAX, but to no avail. The files themselves seem to be utf8 according to "Kate".
Besides that, I have a another function with the same behavior which uses the EXACT same functions, values and files. When the call is started from Perl/Apache, the encoding is ok. Via AJAX, again, it is messed up.
I have been examinating the AJAX Request (jQuery) and could not find anything odd. The encoding seems to be utf8 too.

Perl has a “utf8” flag for every scalar value, which may be “on” or “off”. “On” state of the flag tells perl to treat the value as a string of Unicode characters.
If you take a string with utf8 flag off and concatenate it with a string that has utf8 flag on, perl converts the first one to Unicode. This is the usual source of problems.
You need to either convert both variables to bytes with Encode::encode() or to perl's internal format with Encode::decode() before concatenation.
See perldoc Encode.

Expanding on the previous answer, here's a little more information that I found useful when I started messing with character encodings in Perl.
This is an excellent introduction to Unicode in perl: http://perldoc.perl.org/perluniintro.html. The section "Perl's Unicode Model" is particularly relevant to the issue you're seeing.
A good rule to use in Perl is to decode data to Perl characters on it's way in and encode it into bytes on it's way out. You can do this explicitly using Encode::encode and Encode::decode. If you're reading from/writing to a file handle you can specify an encoding on the filehandle by using binmode and setting layer: perldoc -f binmode
You can tell which of the strings in your example has been decoded into Perl characters using Encode::is_utf8:
use Encode qw( is_utf8 );
print is_utf8($stuff) ? 'characters' : 'bytes';

A colleague of mine found the answer to this problem. It really had something to do with the fact that AJAX started the call.
The file structure is as follows:
1 Handler, accessed by Apache
1 Handler, accessed by Apache but who only contains AJAX responders. We call it the AJAX-Handler
1 package, which contains functions relevant for the entire software, who access yet other packages from our own Framework
Inside of the AJAX-Handler, we print the result as such
sub handler {
my $r = shift;
# processing output
$r->print($output);
return Apache2::Const::OK;
}
Now, when I replace $r->print($output); by print($output);, the problem disappears! I know that this is not the recommended way to print stuff in mod_perl, but this seems to work.
Still, any ideas how to do this the proper way are welcome.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse