How can I make fscanf re-read a line upon a condition being met? - scanf

while( fscanf( tracefile, "%s ", opcode ) != EOF ){blah}
Occasionally I need to cause fscanf to re-read a line upon a certain condition in my code being met. Is this possible; how would I do that?
Thanks.

I almost never use fscanf directly since it's a pain to know where the file pointer is left on an error condition.
I use fgets to pull in a single line, then I can use sscanf to my heart's content without having to go back to the file to re-read.

Assuming your input file is seekable (and not, for example, a pipe or network stream) you could do something like:
fgetpos(tracefile, &position_before);
fscanf( tracefile, "%s ", opcode );
if (need_to_rescan) { fsetpos(tracefile, position_before); }
Backing up and rescanning could be pretty inefficient (as well as the issues of not supporting input from a pipe etc.), so you might want to consider if there's an alternative.

Related

Getting Error of Modification of a read-only value attempted

I am trying to select the below value from database:
Reporting that one of #its many problems had been the recent# extended
sales slump in women's apparel, the seven-store retailer said it would
start a three-month liquidation sale in all of its stores.~(A) its
many problems had been the recent~(B) its many problems has been the
recently~(C) its many problems is the recently~(D) their many problems
is the recent~(E) their many problems had been the recent~
i am selecting this value in variable $ques and then selecting a text as below:
$ques=~s/^(.*?)\#(.*?)\#(.*?)$/$2/;
Now, while replacing the ~ character in the string by
$3=~s/~/\n/g; ---->line 171
and running the script, I am getting one error as:
Modification of a read-only value attempted at main.pl line 171
I want to replace all the ~ character with '\n' and print the final value. Please suggest how to do it.
*I have researched this on net, but got confused that how to handle these read only variables.
You've already got a good explanation of the problem from José Castro. But there's another solution if you're using a recent-ish version of Perl (Update: having checked more carefully, I find that means 5.14+). The /r argument to the substitution operator will copy your string, make the substitution on the copy and then return that altered value.
So you could write:
my $new_value = $3 =~ s/~/\n/rg;
It sounds like what you really want in this case is split rather than regular expression capture groups:
my #parts = split(/#/, $ques);
$parts[2] =~ s/~/\n/g;
It makes the intent of your code clearer since you are, in fact, splitting on # symbols.
Just like you say, the special variables $1, $2, etc., are read-only, and that means that you can't perform that substitution on them.
Performing the substitution on $ques will do what you need:
$ques =~ s/~/\n/g;
print $ques;
Do note that in the earlier substitution that you're performing on $ques you're getting rid of all the ~ characters.

perl memory usage when processing a file inline

I have a CGI script that's used by our employees to fetch logs from servers that they don't have direct access to. For reasons I won't go into, after a recent update to our app some of these logs now have characters like linefeeds, tabs, backslashes, etc. translated into their text equivalents. As such, I've modified the CGI script to invoke the following to convert these back to their original values:
perl -i -pe 's/\\r/\r/g && s/\\n/\n/g && s/\\t/\t/g && s/\\\//\//g' $filename
I was just informed that some people are now getting out of memory errors when they try to fetch logs that are fairly large (a few hundred MB).
My question: How does perl manage memory when an inline command like this is invoked? Is it reading the whole file in, processing it, then writing it out, or is it creating a temporary file, processing the lines from the input file one at a time then replacing the file once complete?
This is using perl 5.10.1 on a 64-bit Amazon linux instance.
The -p switch creates a while(<>){...; print} loop to iterate on each “line” in your input file.
If all of your newlines have been converted into "\\n", then your file would just be a single very long line. Therefore, your command would be loading the entire file into memory to perform your fix.
To avoid that, you'll have to intentionally buffer the file using either sysread or $/.
It would probably be easiest to create an actual script instead of a one-liner to do the work. However, if you know that all of your newlines are converted, then one simple fix would be to use $/ = "\\n"
As a secondary note, your regex is flawed. You're currently listing out your translations s/// using a shortcut operator. If any one of the earlier regexes doesn't match for a particular line, then no other translations would be attempted. You should instead use simple semicolons to separate your regexes:
's/\\r/\r/g; s/\\n/\n/g; s/\\t/\t/g; s|\\/|/|g'

Lexing/Parsing "here" documents

For those that are experts in lexing and parsing... I am attempting to write a series of programs in perl that would parse out IBM mainframe z/OS JCL for a variety of purposes, but am hitting a roadblock in methodology. I am mostly following the lexing/parsing ideology put forth in "Higher Order Perl" by Mark Jason Dominus, but there are some things that I can't quite figure out how to do.
JCL has what's called inline data, which is very similar to "here" documents. I am not quite sure how to lex these into tokens.
The layout for inline data is as follows:
//DDNAME DD *
this is the inline data
this is some more inline data
/*
...
Conventionally, the "*" after the "DD" signifies that following lines are the inline data itself, terminated by either "/*" or the next valid JCL record (starting with "//" in the first 2 columns).
More advanced, the inline data could appear as such:
//DDNAME DD *,DLM=ZZ
//THIS LOOKS LIKE JCL BUT IT'S ACTUALLY DATA
//MORE DATA MASQUERADING AS JCL
ZZ
...
Sometimes the inline data is itself JCL (perhaps to be pumped to a program or the internal reader, whatever).
But here's the rub. In JCL, the records are 80 bytes, fixed in length. Everything past column 72 (cols 73-80) is a "comment". As well, everything following a blank that follows valid JCL is likewise a comment. Since I am looking to manipulate JCL in my programs and spit it back out, I'd like to capture comments so that I can preserve them.
So, here's an example of inline comments in the case of inline data:
//DDNAME DD *,DLM=ZZ THIS IS A COMMENT COL73DAT
data
...
ZZ
...more JCL
I originally thought that I could have my top-most lexer pull in a line of JCL and immediately create a non-token for cols 1-72 and then a token (['COL73COMMENT',$1]) for the column 73 comment, if any. This would then pass downstream to the next iterator/tokenizer a string of the cols 1-72 text followed by the col73 token.
But how would I, downstream from there, grab the inline data? I'd originally figured that the top-most tokenizer could look for a "DD \*(,DLM=(\S*))" (or the like) and then just keep pulling records from the feeding iterator until it hit the delimiter or a valid JCL starter ("//").
But you may see the issue here... I can't have 2 topmost tokenizers... either the tokenizer that looks for COL73 comments must be the top or the tokenizer that gets inline data must be at the top.
I imagine that perl parsers have the same challenge, since seeing
<<DELIM
isn't necessarily the end of the line, followed by the here document data. After all, you could see perl like:
my $this=$obj->ingest(<<DELIM)->reformat();
inline here document data
more data
DELIM
How would the tokenizer/parser know to tokenize the ")->reformat();" and then still grab the following records as-is? In the case of the inline JCL data, those lines are passed as-is, cols 73-80 are NOT comments in that case...
So, any takers on this? I know there will be tons of questions clarifying my needs and I'm happy to clarify as much as is needed.
Thanks in advance for any help...
In this answer I will concentrate on heredocs, because the lessons can be easily transferred to the JCL.
Any language that supports heredocs is not context-free, and thus cannot be parsed with common techniques like recursive descent. We need a way to guide the lexer along more twisted paths, but in doing so, we can maintain the appearance of a context-free language. All we need is another stack.
For the parser, we treat introductions to heredocs <<END as string literals. But the lexer has to be extended to do the following:
When a heredoc introduction is encountered, it adds the terminator to the stack.
When a newline is encountered, the body of the heredoc is lexed, until the stack is empty. After that, normal parsing is resumed.
Take care to update the line number appropriately.
In a hand-written combined parser/lexer, this could be implemented like so:
use strict; use warnings; use 5.010;
my $s = <<'INPUT-END'; pos($s) = 0;
<<A <<B
body 1
A
body 2
B
<<C
body 3
C
INPUT-END
my #strs;
push #strs, parse_line() while pos($s) < length($s);
for my $i (0 .. $#strs) {
say "STRING $i:";
say $strs[$i];
}
sub parse_line {
my #strings;
my #heredocs;
$s =~ /\G\s+/gc;
# get the markers
while ($s =~ /\G<<(\w+)/gc) {
push #strings, '';
push #heredocs, [ \$strings[-1], $1 ];
$s =~ /\G[^\S\n]+/gc; # spaces that are no newlines
}
# lex the EOL
$s =~ /\G\n/gc or die "Newline expected";
# process the deferred heredocs:
while (my $heredoc = shift #heredocs) {
my ($placeholder, $marker) = #$heredoc;
$s =~ /\G(.*\n)$marker\n/sgc or die "Heredoc <<$marker expected";
$$placeholder = $1;
}
return #strings;
}
Output:
STRING 0:
body 1
STRING 1:
body 2
STRING 2:
body 3
The Marpa parser simplifies this a bit by allowing events to be triggered once a certain token is parsed. These are called pauses, because the built-in lexing pauses a moment for you to take over. Here is a high-level overview and a short blogpost describing this technique with the demo code on Github.
In case anyone was wondering how I decided to resolve this, here is what I did.
My main lexing routine accepts an iterator that pumps full lines of text (which can take it from a file, a string, whatever I want). The routine uses that to create another iterator, which examines the line for "comments" after column 72, which it will then return as a "mainline" token followed by a "col72" token. This iterator is then used to create yet another iterator, which passes the col72 tokens through unchanged, but takes the mainline tokens and lexes them into atomic tokens (things like STRING, NUMBER, COMMA, NEWLINE, etc).
But here's the crux... the lexing routine has the ORIGINAL ITERATOR still... so when it receives a token that indicates there is a "here" document, it continues processing tokens until it hits a NEWLINE token (meaning end of the actual line of text) and then uses the original iterator to pull off the here document data. Since that iterator feeds the atomic tokens iterator, pulling from it then prevents those lines from being atomized.
To illustrate, think of iterators like hoses. The first hose is the main iterator. To that I attach the col72 iterator hose, and to that I attach the atomic tokenizer hose. As streams of characters go in the first hose, atomized tokens come out the end of the third hose. But I can attach a 2-way nozzle to the first hose that will allow its output to come out the alternate nozzle, preventing that data from going into the second hose (and hence the third hose). When I'm done diverting the data through the alternate nozzle, I can turn that off and then data begins flowing through the second and third hoses again.
Easy-peasey.

How to access buffer contents of Expect module in perl

I am using expect to automate terminal based applications. I will send data depending on result from "expect" command. I knew that expect, while doing a string matching stores all the unmatched string patterns in a buffer. For example $expect_out(0,string) is used to store the string that expect is actually waiting for, while $expect_out(buffer) contains all the unmatched string patterns occurred till the previous command.
I want to know if there is any way of accessing these expect buffers, like copying expect buffer contents into some variable as shown below
$mybuffer = $expect_out(buffer);
but the above statement is actually throwing an error "syntax error at perl_app_hh.pl line 72, near "$expect_out(""
I just want to copy contents of expect buffer to a variable. So please help me on this issue.
You're going to have to read the documentation for the Expect module. $expect(buffer) is not valid Perl.
$exp = Expect->spawn(...);
$exp->send(...);
$buffer = $exp->before();

Perl: Problem with changing encoding in the middle of reading a file

I am using Perl to load some 'macro' files. These macros can, however, be encoded in various encodings, so there is a directive defined for users writing their macros (i.e.
#encoding iso-8859-2
at the beginning of the macro).
Every time this directive is encountered in the macro, a function setting encoding is called and looks sth like this:
sub change_encoding {
my ($file_handle, $encoding) = #_;
$file_handle->flush();
binmode($file_handle); # get rid of IO layers
binmode($file_handle,":encoding($encoding)");
}
The problem is that when I read the macro using standard
while($line = <$file_handle>){
process_macro($line);
}
I got messages saying "utf8 "\xXY" does not map to Unicode", but only if characters with diacritics is near the #encoding directive. I tried several examples and I was able to have half of the string with \xXY codes and other half of the string with correctly decoded characters, like here:
sub macro5_fn {
print "\xBElu\xBBou\xE8k\xFD k\xF9\xF2 úpěl ďábelské ódy\n";
}
If I put more comments before the function, all the characters are OK:
sub macro5_fn {
print "žluťoučký kůň úpěl ďábelské ódy\n";
}
Simply said, the number of correctly decoded characters depends on the distance of these characters from the #encoding directive, the ones that are close are not decoded correctly.
It seems to me that this is an issue of Perl and PerlIO (not) flushing the buffer. Or am I doing something wrong?
Thank you for your answers.
The problem is that <> reads more than just one line, so the next line or so is being interpreted under the old encoding before you ever see the #encoding directive for the new.
Your best bet is probably to read the file in binary mode and use the Encode module to decode each line from the current encoding.