Lexing/Parsing "here" documents - perl

For those that are experts in lexing and parsing... I am attempting to write a series of programs in perl that would parse out IBM mainframe z/OS JCL for a variety of purposes, but am hitting a roadblock in methodology. I am mostly following the lexing/parsing ideology put forth in "Higher Order Perl" by Mark Jason Dominus, but there are some things that I can't quite figure out how to do.
JCL has what's called inline data, which is very similar to "here" documents. I am not quite sure how to lex these into tokens.
The layout for inline data is as follows:
//DDNAME DD *
this is the inline data
this is some more inline data
/*
...
Conventionally, the "*" after the "DD" signifies that following lines are the inline data itself, terminated by either "/*" or the next valid JCL record (starting with "//" in the first 2 columns).
More advanced, the inline data could appear as such:
//DDNAME DD *,DLM=ZZ
//THIS LOOKS LIKE JCL BUT IT'S ACTUALLY DATA
//MORE DATA MASQUERADING AS JCL
ZZ
...
Sometimes the inline data is itself JCL (perhaps to be pumped to a program or the internal reader, whatever).
But here's the rub. In JCL, the records are 80 bytes, fixed in length. Everything past column 72 (cols 73-80) is a "comment". As well, everything following a blank that follows valid JCL is likewise a comment. Since I am looking to manipulate JCL in my programs and spit it back out, I'd like to capture comments so that I can preserve them.
So, here's an example of inline comments in the case of inline data:
//DDNAME DD *,DLM=ZZ THIS IS A COMMENT COL73DAT
data
...
ZZ
...more JCL
I originally thought that I could have my top-most lexer pull in a line of JCL and immediately create a non-token for cols 1-72 and then a token (['COL73COMMENT',$1]) for the column 73 comment, if any. This would then pass downstream to the next iterator/tokenizer a string of the cols 1-72 text followed by the col73 token.
But how would I, downstream from there, grab the inline data? I'd originally figured that the top-most tokenizer could look for a "DD \*(,DLM=(\S*))" (or the like) and then just keep pulling records from the feeding iterator until it hit the delimiter or a valid JCL starter ("//").
But you may see the issue here... I can't have 2 topmost tokenizers... either the tokenizer that looks for COL73 comments must be the top or the tokenizer that gets inline data must be at the top.
I imagine that perl parsers have the same challenge, since seeing
<<DELIM
isn't necessarily the end of the line, followed by the here document data. After all, you could see perl like:
my $this=$obj->ingest(<<DELIM)->reformat();
inline here document data
more data
DELIM
How would the tokenizer/parser know to tokenize the ")->reformat();" and then still grab the following records as-is? In the case of the inline JCL data, those lines are passed as-is, cols 73-80 are NOT comments in that case...
So, any takers on this? I know there will be tons of questions clarifying my needs and I'm happy to clarify as much as is needed.
Thanks in advance for any help...

In this answer I will concentrate on heredocs, because the lessons can be easily transferred to the JCL.
Any language that supports heredocs is not context-free, and thus cannot be parsed with common techniques like recursive descent. We need a way to guide the lexer along more twisted paths, but in doing so, we can maintain the appearance of a context-free language. All we need is another stack.
For the parser, we treat introductions to heredocs <<END as string literals. But the lexer has to be extended to do the following:
When a heredoc introduction is encountered, it adds the terminator to the stack.
When a newline is encountered, the body of the heredoc is lexed, until the stack is empty. After that, normal parsing is resumed.
Take care to update the line number appropriately.
In a hand-written combined parser/lexer, this could be implemented like so:
use strict; use warnings; use 5.010;
my $s = <<'INPUT-END'; pos($s) = 0;
<<A <<B
body 1
A
body 2
B
<<C
body 3
C
INPUT-END
my #strs;
push #strs, parse_line() while pos($s) < length($s);
for my $i (0 .. $#strs) {
say "STRING $i:";
say $strs[$i];
}
sub parse_line {
my #strings;
my #heredocs;
$s =~ /\G\s+/gc;
# get the markers
while ($s =~ /\G<<(\w+)/gc) {
push #strings, '';
push #heredocs, [ \$strings[-1], $1 ];
$s =~ /\G[^\S\n]+/gc; # spaces that are no newlines
}
# lex the EOL
$s =~ /\G\n/gc or die "Newline expected";
# process the deferred heredocs:
while (my $heredoc = shift #heredocs) {
my ($placeholder, $marker) = #$heredoc;
$s =~ /\G(.*\n)$marker\n/sgc or die "Heredoc <<$marker expected";
$$placeholder = $1;
}
return #strings;
}
Output:
STRING 0:
body 1
STRING 1:
body 2
STRING 2:
body 3
The Marpa parser simplifies this a bit by allowing events to be triggered once a certain token is parsed. These are called pauses, because the built-in lexing pauses a moment for you to take over. Here is a high-level overview and a short blogpost describing this technique with the demo code on Github.

In case anyone was wondering how I decided to resolve this, here is what I did.
My main lexing routine accepts an iterator that pumps full lines of text (which can take it from a file, a string, whatever I want). The routine uses that to create another iterator, which examines the line for "comments" after column 72, which it will then return as a "mainline" token followed by a "col72" token. This iterator is then used to create yet another iterator, which passes the col72 tokens through unchanged, but takes the mainline tokens and lexes them into atomic tokens (things like STRING, NUMBER, COMMA, NEWLINE, etc).
But here's the crux... the lexing routine has the ORIGINAL ITERATOR still... so when it receives a token that indicates there is a "here" document, it continues processing tokens until it hits a NEWLINE token (meaning end of the actual line of text) and then uses the original iterator to pull off the here document data. Since that iterator feeds the atomic tokens iterator, pulling from it then prevents those lines from being atomized.
To illustrate, think of iterators like hoses. The first hose is the main iterator. To that I attach the col72 iterator hose, and to that I attach the atomic tokenizer hose. As streams of characters go in the first hose, atomized tokens come out the end of the third hose. But I can attach a 2-way nozzle to the first hose that will allow its output to come out the alternate nozzle, preventing that data from going into the second hose (and hence the third hose). When I'm done diverting the data through the alternate nozzle, I can turn that off and then data begins flowing through the second and third hoses again.
Easy-peasey.

Related

Using Perl to parse text from blocks

I have a file with multiple blocks of test. FOR EACH block of test, I want to be able to extract what is in the square bracket, the line containing the FIRST instance of the word "area", and what is on the right of the square bracket. Everything will be a string. Essentially what I want to do is store each string into a variable in a hash so i can print it into a 3 column csv file.
Here's a sample of what the file looks like:
Student-[K-6] Exceptional in Math
/home/area/kinder/mathadvance.txt, 12
Students in grade K-12 shown to be exceptional in math.
Placed into special after school program.
See /home/area/overall/performance.txt, 200
Student-[Junior] Weak Performance
Students with overall weak performance.
Summer program services offered as shown in
"/home/area/services/summer.txt", 212
Student-[K-6] Physical Excerise Time Slots
/home/area/pe/schedule.txt, 303
Assigned time slots for PE based on student's grade level. Make reference to
/home/area/overall/classtimes.txt, 90
I want to to have a final csv file that looks like:
Grade,Topic,Path
K-6, Exceptional in Math, /home/area/kinder/mathadvance.txt, 12
K-6, Physical Exercise Time Slots, /home/area/pe/schedule.txt, 303
Junior, Weak Performance, "/home/area/services/summer.txt", 212
Since it's a csv file, I know it will also separate at the line number when exporting into excel but I'm fine with that.
I started off by putting the grade type into an array because I want to be able to add more strings to it for different grade levels.
My program looks like this so far:
#!/usr/bin/perl
use strict;
use warnings;
my #grades = ("K-6", "Junior", "Community-College", "PreK");
I was thinking that I will need to do some sort of system sed command to grab what is in the brackets and store it into a variable. Then I will grab everything to the right of the bracket on the line and store it into a variable. And then I will grep for a line containing "area" to get the path and I will store it as a string into a variable, put these in a hash, and then print into csv. I'm not sure if I'm thinking about this the right way. Also, I have NO IDEA how to do this for each BLOCK of text in the file. I need it by block because each block has its own corresponding grades, topics, and paths.
perl -000 -ne '($grade, $topic) = /\[(.*)\] (.*)/;
($path) = m{(.*/area/.*)};
print "$grade, $topic, $path\n"' -- file.txt
-000 turns on paragraph mode, -n won't read line by line, but paragraph by paragraph
/\[(.*)\] (.*)/ matches the square brackets and whatever follows them up to a newline. The inside of the square brackets and the following text are captured using the parentheses.
m{(.*/area/.*)} captures the line containing "area". It uses the m{} syntax instead of // so we don't have to backslash the slashes (avoiding so called "leaning toothpick syndrome")

How do I write a perl6 macro to enquote text?

I'm looking to create a macro in P6 which converts its argument to a string.
Here's my macro:
macro tfilter($expr) {
quasi {
my $str = Q ({{{$expr}}});
filter-sub $str;
};
}
And here is how I call it:
my #some = tfilter(age < 50);
However, when I run the program, I obtain the error:
Unable to parse expression in quote words; couldn't find final '>'
How do I fix this?
Your use case, converting some code to a string via a macro, is very reasonable. There isn't an established API for this yet (even in my head), although I have come across and thought about the same use case. It would be nice in cases such as:
assert a ** 2 + b ** 2 == c ** 2;
This assert statement macro could evaluate its expression, and if it fails, it could print it out. Printing it out requires stringifying it. (In fact, in this case, having file-and-line information would be a nice touch also.)
(Edit: 007 is a language laboratory to flesh out macros in Perl 6.)
Right now in 007 if you stringify a Q object (an AST), you get a condensed object representation of the AST itself, not the code it represents:
$ bin/007 -e='say(~quasi { 2 + 2 })'
Q::Infix::Addition {
identifier: Q::Identifier "infix:+",
lhs: Q::Literal::Int 2,
rhs: Q::Literal::Int 2
}
This is potentially more meaningful and immediate than outputting source code. Consider also the fact that it's possible to build ASTs that were never source code in the first place. (And people are expected to do this. And to mix such "synthetic Qtrees" with natural ones from programs.)
So maybe what we're looking at is a property on Q nodes called .source or something. Then we'd be able to do this:
$ bin/007 -e='say((quasi { 2 + 2 }).source)'
2 + 2
(Note: doesn't work yet.)
It's an interesting question what .source ought to output for synthetic Qtrees. Should it throw an exception? Or just output <black box source>? Or do a best-effort attempt to turn itself into stringified source?
Coming back to your original code, this line fascinates me:
my $str = Q ({{{$expr}}});
It's actually a really cogent attempt to express what you want to do (turn an AST into its string representation). But I doubt it'll ever work as-is. In the end, it's still kind of based on a source-code-as-strings kind of thinking à la C. The fundamental issue with it is that the place where you put your {{{$expr}}} (inside of a string quote environment) is not a place where an expression AST is able to go. From an AST node type perspective, it doesn't typecheck because expressions are not a subtype of quote environments.
Hope that helps!
(PS: Taking a step back, I think you're doing yourself a disservice by making filter-sub accept a string argument. What will you do with the string inside of this function? Parse it for information? In that case you'd be better off analyzing the AST, not the string.)
(PPS: Moritz++ on #perl6 points out that there's an unrelated syntax error in age < 50 that needs to be addressed. Perl 6 is picky about things being defined before they are used; macros do not change this equation much. Therefore, the Perl 6 parser is going to assume that age is a function you haven't declared yet. Then it's going to consider the < an opening quote character. Eventually it'll be disappointed that there's no >. Again, macros don't rescue you from needing to declare your variables up-front. (Though see #159 for further discussion.))

How does this Perl one-liner actually work?

So, I happened to notice that last.fm is hiring in my area, and since I've known a few people who worked there, I though of applying.
But I thought I'd better take a look at the current staff first.
Everyone on that page has a cute/clever/dumb strapline, like "Is life not a thousand times too short for us to bore ourselves?". In fact, it was quite amusing, until I got to this:
perl -e'print+pack+q,c*,,map$.+=$_,74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34'
Which I couldn't resist pasting into my terminal (kind of a stupid thing to do, maybe), but it printed:
Just another Last.fm hacker,
I thought it would be relatively easy to figure out how that Perl one-liner works. But I couldn't really make sense of the documentation, and I don't know Perl, so I wasn't even sure I was reading the relevant documentation.
So I tried modifying the numbers, which got me nowhere. So I decided it was genuinely interesting and worth figuring out.
So, 'how does it work' being a bit vague, my question is mainly,
What are those numbers? Why are there negative numbers and positive numbers, and does the negativity or positivity matter?
What does the combination of operators +=$_ do?
What's pack+q,c*,, doing?
This is a variant on “Just another Perl hacker”, a Perl meme. As JAPHs go, this one is relatively tame.
The first thing you need to do is figure out how to parse the perl program. It lacks parentheses around function calls and uses the + and quote-like operators in interesting ways. The original program is this:
print+pack+q,c*,,map$.+=$_,74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34
pack is a function, whereas print and map are list operators. Either way, a function or non-nullary operator name immediately followed by a plus sign can't be using + as a binary operator, so both + signs at the beginning are unary operators. This oddity is described in the manual.
If we add parentheses, use the block syntax for map, and add a bit of whitespace, we get:
print(+pack(+q,c*,,
map{$.+=$_} (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21,
18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34)))
The next tricky bit is that q here is the q quote-like operator. It's more commonly written with single quotes:
print(+pack(+'c*',
map{$.+=$_} (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21,
18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34)))
Remember that the unary plus is a no-op (apart from forcing a scalar context), so things should now be looking more familiar. This is a call to the pack function, with a format of c*, meaning “any number of characters, specified by their number in the current character set”. An alternate way to write this is
print(join("", map {chr($.+=$_)} (74, …, -34)))
The map function applies the supplied block to the elements of the argument list in order. For each element, $_ is set to the element value, and the result of the map call is the list of values returned by executing the block on the successive elements. A longer way to write this program would be
#list_accumulator = ();
for $n in (74, …, -34) {
$. += $n;
push #list_accumulator, chr($.)
}
print(join("", #list_accumulator))
The $. variable contains a running total of the numbers. The numbers are chosen so that the running total is the ASCII codes of the characters the author wants to print: 74=J, 74+43=117=u, 74+43-2=115=s, etc. They are negative or positive depending on whether each character is before or after the previous one in ASCII order.
For your next task, explain this JAPH (produced by EyesDrop).
''=~('(?{'.('-)#.)#_*([]#!#/)(#)#-#),#(##+#)'
^'][)#]`}`]()`#.#]#%[`}%[#`#!##%[').',"})')
Don't use any of this in production code.
The basic idea behind this is quite simple. You have an array containing the ASCII values of the characters. To make things a little bit more complicated you don't use absolute values, but relative ones except for the first one. So the idea is to add the specific value to the previous one, for example:
74 -> J
74 + 43 -> u
74 + 42 + (-2 ) -> s
Even though $. is a special variable in Perl it does not mean anything special in this case. It is just used to save the previous value and add the current element:
map($.+=$_, ARRAY)
Basically it means add the current list element ($_) to the variable $.. This will return a new array with the correct ASCII values for the new sentence.
The q function in Perl is used for single quoted, literal strings. E.g. you can use something like
q/Literal $1 String/
q!Another literal String!
q,Third literal string,
This means that pack+q,c*,, is basically pack 'c*', ARRAY. The c* modifier in pack interprets the value as characters. For example, it will use the value and interpret it as a character.
It basically boils down to this:
#!/usr/bin/perl
use strict;
use warnings;
my $prev_value = 0;
my #relative = (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34);
my #absolute = map($prev_value += $_, #relative);
print pack("c*", #absolute);

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}

Reading custom values in Ebay RSS feed (XML::RSS module)

I've spent entirely way too long trying to figure this out. I'm using XML: RSS and Perl to read / parse an Ebay RSS feed. Within the <item></item> area, I see these entries:
<rx:BuyItNowPrice xmlns:rx="urn:ebay:apis:eBLBaseComponents">1395</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:ebay:apis:eBLBaseComponents">1255</rx:CurrentPrice>
However, I can't figure out how to grab the details during the loop. I wrote a regex to grab them:
#current_price = $item =~ m/\<rx\:CurrentPrice.*\>(\d+)\<\/rx\:CurrentPrice\>/g;
Which works if you place the above 'CurrentPrice' entry into a standalone string, but not while the script is reading through the RSS feed.
I can grab most of the information I want out of the item->description area (# bids, auction end time, BIN price, thumbnail image, etc.), but it would be nicer if I could grab the info from the feed without me having to deal with grabbing all that information manually.
How to grab custom fields from an RSS feed (short of writing regexes to parse the entire feed w/o a module)?
Here's the code I'm working with:
$my_limit = 0;
use LWP::Simple;
use XML::RSS;
$rss = XML::RSS->new();
$data = get( $mylink );
$rss->parse( $data );
$channel = $rss->{channel};
$NumItems = 0;
foreach $item (#{$rss->{'items'}}) {
if($NumItems > $my_limit){
last;
}
#current_price = $item =~ m/\<rx\:CurrentPrice.*\>(\d+)\<\/rx\:CurrentPrice\>/g;
print "$current_price[0]";
}
If you have the rss/xml document and want specific data you could use XPATH:
Perl CPAN XPATH
XPath Introduction
What is the way in which "it doesn't work" from an RSS feed? Do you mean no matches when there should be matches? Or one match where there should be several matches?
One thing that jumps out at me about your regular expression is that you use .*, which can sometimes be greedier than you want. That is, if $item contained the expression
<rx:BuyItNowPrice xmlns:rx="urn:...nts">1395</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:...nts">1255</rx:CurrentPrice>
<rx:BuyItNowPrice xmlns:rx="urn:...nts">1395</rx:BuyItNowPrice>
<rx:SomeMoreStuff xmlns:rx="urn:...nts">zzz</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:...nts">1255</rx:CurrentPrice>
then the first part of your regular expression (\<rx\:CurrentPrice.*\>) will wind up matching everything on lines 2, 3, and 4, plus the first part of line 5 (up to the >). Instead, you might want to use the regular expression1
m/\<rx:CurrentPrice[^>]*>(\d+)\<\/rx:CurrentPrice\>/
which will only match up to the closing </rx:CurrentPrice> tag after a single instance of an opening <rx:CurrentPrice> tag.
1 The other obvious answer is that you really don't want to use a regular expression at all, that regular expressions are inferior tools for parsing XML compared to customized parsing modules, and that all the special cases you will have to deal with using regular expressions will eventually render you unconscious from having repeatedly beaten your head against your desk. See Salgar's answer, for example.