How to trigger Perl multiline substitution - perl

I have a folder of HTML files which have the below DOCTYPE declaration which I need to remove, so that a not-very-good parser can successfully load it as XML.
I've been trying to use perl to do the substitution in place, but no change is made when I run the substitution and I can't figure out why. Can anyone identify the correct flags or specification I need to make in order to remove the DOCTYPE processing instruction here.
Here's an example file I'd like to manipulate.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />
<title></title>
</head>
<body>
</body>
</html>
Here's the perl one-liner I'm trying to use, which looks for the angle brackets, the exclamation mark, and everything before the close angle bracket. It incorporates perl substitution flags which other postings suggest should work for a multiline match - m for multiline, s for allowing newlines to be matched by regex. I'm then replacing the match with the empty string.
perl -i -e 's/<![^>]+>//gsm' `find . -name '*.html'`
I can't figure out why, but the DOCTYPE is not removed from the file after running this command. Does anyone else know why?

What you need is the -0777 switch which will cause the entire file to be read into a single string. If this is not used, the files will be read in line-by-line mode, and you can never match a multi-line statement that way.
Also, as Andomar points out, you are missing the -p switch, but I assume you figured that out.
The modifiers on the regex won't matter in this case, except the /g modifier. /m only affects ^ and $, and /s causes wildcard . to also match newlines. None of this applies to your regex.
So basically, you want something like:
perl -0777 -pi -e 's/<![^>]+>//g' ...
Side note:
Html should be handled with parsers, ideally, so I spent a few minutes working on using HTML::Parser which has a convenient option to strip declarations by adding a handler. Something like this seems to print ok for a single file:
perl -MHTML::Parser -we '
$p = HTML::Parser->new(default_h => [sub {print #_},"text"] );
$p->handler(declaration => "");
$p->parse_file(shift) or die $!; ' yourfile.html
I figured it would be overkill so I abandoned trying to fix it with the -pi in-place edit switches, but it is (probably) easily implemented in a script.

First, you seem to be missing the -p parameter, for processing the input line by line. -i doesn't seem to do much without -p.
Second, since -pi processes the input line-by-line, it can't replace a regex that spans more than one line.
You could write a Perl script instead. This script should run your regex on the entire content of all files passed on the command line:
use IO::All;
foreach my $file (#ARGV) {
my $content = io($file)->slurp;
$content =~ s/<![^>]+>//g;
$content > io($file);
}
The command cpan IO:All should install the IO:All module, if it is not present on your system.
P.S. The m and s options only affect ., ^ and $. I think you can omit them.

Related

Error while running sed command in perl cript

I am trying to run the following command in perl script :
#!/usr/bin/perl
my $cmd3 =`sed ':cycle s/^\(\([^,]*,\)\{0,13\}[^,|]*\)|[^,]*/\1/;t cycle' file1 >file2`;
system($cmd3);
but is not producing any output nor any error.
Although when I am running the command from command line it is working perfectly and gives desired output. Can you guys please help what I am doing wrong here ?
Thanks
system() doesn't return the output, just the exit status.
To see the output, print $cmd3.
my $cmd3 = `sed ':cycle s/^\(\([^,]*,\)\{0,13\}[^,|]*\)|[^,]*/\1/;t cycle' file1 >file2`;
print "$cmd3\n";
Edit:
If you want to check for exceptional return values, use CPAN module IPC::System::Simple:
use IPC::System::Simple qw(capture);
my $result = capture("any-command");
Running sed from inside Perl is just insane.
#!/usr/bin/perl
open (F, '<', "file1") or die "$O: Could not open file1: $!\n";
while (<F>) {
1 while s/^(([^,]*,){0,13}[^,|]*)\|[^,]*/$1/;
print;
}
Notice how Perl differs from your sed regex dialect in that grouping parentheses and alternation are unescaped, whereas a literal round parenthesis or pipe symbol needs to be backslash-escaped (or otherwise made into a literal, such as by putting it in a character class). Also, the right-hand side of the substitution prefers $1 (you will get a warning if you use warnings and have \1 in the substitution; technically, at this level, they are equivalent).
man perlrun has a snippet explaining how to implement the -i option inside a script if you really need that, but it's rather cumbersome. (Search for the first occurrence of "LINE:" which is part of the code you want.)
However, if you want to modify file1 in-place, and you pass it to your Perl script as its sole command-line argument, you can simply say $^I = 1; (or with use English; you can say $INPLACE_EDIT = 1;). See man perlvar.
By the way, the comment that your code "isn't producing any output" isn't entirely correct. It does what you are asking it to; but you are apparently asking for the wrong things.
Quoting a command in backticks executes that command. So
my $cmd3 = `sed ... file1 >file2`;
runs the sed command in a subshell, there and then, with input from file1, and redirected into file2. Because of the redirection, the output from this pipeline is nothing, i.e. an empty string "", which is assigned to $cmd3, which you then completely superfluously attempt to pass to system.
Maybe you wanted to put the sed command in regular quotes instead of backticks (so that the sed command line would be the value of $cmd3, which it then makes sense to pass to system). But because of the redirection, it would still not produce any visible output; it would create file2 containing the (possibly partially substituted) text from file1.

removing text after last \

I've been trying to do some perl regex's and have hit the wall.
I'm trying to do some data analysis of a log file and I'm running into the following problem:
I have a file, test.csv, that is comprised of multiple single line entries from another program that produces the following layout format:
d:\snow\dir.txt
d:\snow\history\dir.tff
d:\snow\history\help.jar
d:\winter\show\help.txt
d:\summer\beach\ocean\swimming.txt
What I want would like to do is delete the file names from the path listing, so the resulting file would contain:
d:\snow\
d:\snow\history\
d:\snow\history\
d:\winter\show\
d:\summer\beach\ocean\
I've banged my head against the wall on this one and have tried various perl regex's in an attempt to drop the file names out without much luck. Since the paths to the directories are of varying length, I'm hitting a wall, I'm not sure if this is something that I can do within perl or python.
You can do this with one line in Perl:
perl -pe 's/[^\\]+$/\n/' <infile.txt >outfile.txt
Taking this in pieces:
-p causes Perl to wrap the statement (supplied with -e) in a while loop, apply the statement to each line of the input file, and print the result.
-e gives Perl a statement to run against every line.
s/[^\\]+$/\n/ is a substitution statement that uses a regular expression to change any sequence of characters not including a backslash at the end of the line, to just a newline \n.
[^\\] is a regular expression that matches any single character that is not a backslash
[^\\]+ is a regular expression that matches one or more characters that are not a backslash
[^\\]+$ is a regular expression that matches one or more characters that are not a backslash followed by the end of the line
Using regexes might work, but using a module designed for this purpose is generally speaking a better idea. File::Basename or File::Spec are suitable core modules for this purpose:
Code:
use strict;
use warnings;
use v5.10;
use File::Basename;
say dirname($_) for <DATA>;
__DATA__
d:\snow\dir.txt
d:\snow\history\dir.tff
d:\snow\history\help.jar
d:\winter\show\help.txt
d:\summer\beach\ocean\swimming.txt
Output:
d:\snow
d:\snow\history
d:\snow\history
d:\winter\show
d:\summer\beach\ocean
Of course, if you want ending backslashes, you'll have to add them.
And for File::Spec:
my ($volume, $dir, $file) = File::Spec->splitpath($path);
my $wanted_path = $volume . $dir; # what you want
These two modules have been part of the core distribution for a long time, which is a nice benefit.
You can do with this one liner also
perl -pe s /\\\\\w+\.\w+$// test.csv > Output.txt
\w+\.\w+$ matches for the filename with the extension which is at the end of the path
Here's one way to do it in Python:
python -c 'import sys,re;[sys.stdout.write(re.sub("[^\\\]+$","\n",l))for l in sys.stdin]' < in.txt > out.txt
I'll admit it's a bit more verbose than a Perl solution.

Executing bash script, read its output and create html with Perl

I have a bash script which produces different integer values. When I run that script, the output looks like this:
12
34
34
67
6
This script runs on a Solaris server. In order to provide other users in the network with these values, I decided to write a Perl script which can:
run the bash file
read its output
build a tiny html page with a table in which the bash values are stored
Thats a hard job for me because I have almost no experience with Perl. I know I can use system to execute unix commands (and bash files) but I cannot get the output. I also heared about qx which sounds very useful for my case.
But I must admit I have no clue how do start... Could you give me a few hints how to solve that?
With a question like this it's a little hard to know where to begin.
The qx to which you are referring is a feature of Perl. The "q*" or "Quote and Quote-like Operators" are documented in the Perl "operators" man page (normally you'd use man perlop to read that on systems with a conventional installation of Perl).
Specifically qx is the "quoted-execution of a command" ... which is essentially an alternative form of the ` (back tick or "command substitution") operator in Perl.
In other words if you execute a command like:
perl -e '$foo = qx{ls}; print "\n###\n$foo\n###\n";'
... on a system with Perl installed then it should run Perl, which should evaluate (-e) the expression you've provided (quoted). In other words we're writing a small program right on the command line. This program starts by creating a variable whose contents will be a "scalar" (which is Perl terminology for a string or number). We're assigning (the =, or assignment, operator) the output which is captured by executing the ls command back to this variable ($foo). After that we're printing the contents of our variable (whatever the ls command would have printed) with ### lines preceding and following those contents..
A quirk of Perl's qx operator (and the various other q* operators) is that it allows you to delimit the command with just about any characters you like. For example perl -e '$bar = qx/pwd/;' would capture the output of the pwd command. When you use any of the characters that are normally used as delimiters around text parentheses, braces, brackets, etc) then the qx command will look for the appropriate matching delimiter. If you use any other punctuation (or non-alpha-numeric character?) then that same character will be the terminating delimiter as well. This later behavior is similar to, and was inspired by, a feature in "substitution" command from the old sed utility and ed line editors; while the matching of parentheses, braces, etc. are a Perl novelty.
So that's the basics of how to capture your shell script's output. To print the numbers in an HTML table you'd have to split the captured output into separate lines (saving them into a list or array) then print your HTML prologue (the <table> and <th> (header) tags, and so on) ... them loop over a series of <tr> rows, interpolating your numbers into <td>> (table data) containers) and then finally print your HTML epilogue (with the closing tags).
For that you'll want to read up on the Perl print function and about "interpolation" in Perl. That's a fairly complex topic.
This is all extremely crude and there are tools around which allow you to approach the generation of HTML at a much higher level. It's also rather dubious that you want to wrap the execution of your shell script in a Perl script since it seems likely that you could modify the shell script to directly output HTML (perhaps as an option controlled by a command line switch or environment variable) or that you could re-write the shell script in Perl. This could potentially eliminate the extra work of parsing the output (splitting it into lines and separating the values out of those lines into an array because you can capture the data directly into the array (or possibly print out your HTML rows) directly as you are generating them (however your existing shell script is doing that).
To capture the output of your bash file, you can use the backtick operator:
use strict;
my $e = `ls`;
print $e;
Many, many thanks to you! With your great help. I was able to build a perl script which does a big part of the job.
This is what I have created so far:
#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
#some variables
my $message = "please wait, loading data...\n";
#First build the web page
print header;
print start_html('Hello World');
print "<H1>we need love, peace and harmony</H1>\n";
print "<p>$message</p>\n";
#Establish a pipeline between the bash and my script.
my $bash_command = '/love/peace/harmony/./lovepeace.bash';
open(my $pipe, '-|', $bash_command) or die $!;
while (my $line = <$pipe>){
# Do something with each line.
print "<p>$line</p>\n";
}
#job done, now refresh page?
print end_html;
When I call that .pl script in my browser, everything works nice :-) But a few questions are still on my mind:
When I call this website, it is busy loading the values from the pipe. Since there are about 10 Values its rather
quick (2-4 seconds) But if I have 100+ Values the user has to wait a while. Since I cannot have a progress bar, I
should give an information to the user. Like:
"Loading data, please wait..."
And when the job is done, this message should say: "Job done" or something similar.
But how do I realize if the process is finnished?
can I reload the page if the job is done ?
Is there any chance of using my own stylesheet wihtin this perl-CGI
Regards,
JJ
Why only perl:
you can use awk for that in side your shell script itself.
I have done this earlier.
if you have the out put values in a variable then use the below method:
echo $SUBSCRIBERS|awk 'BEGIN {
print "<?xml version=\"1.0\" encoding=\"UTF-8\"?><GenTransactionHandler xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"><EntityToPublish>\n<Entitytype=\"C\" typeDesc=\"Subscriber level\"><TargetApplCode>UHUNLD</TargetApplCode><TrxName>GET_SUBSCR_DATA</TrxName>"
}
{for(i=1;i<NF+1;i++) printf("<value>%d</value>\n",$i)}
END{
print "</Entity>\n</EntityToPublish></GenTransactionHandler>"}' >XML_SUB_NUM`date +%Y%m%d%H%M%S`.xml
in $SUBSCRIBERS the values should eb tab separated.

Substituting text in a file within a Perl script

I am using webmin and I am trying to change some settings in a file. I am having problems if the person uses any weird characters that might trip up sed or Perl using the following code:
&execute_command("sed -i 's/^$Pref.*\$/$Pref \"$in{$Pref}\"/g' $DIR/pserver.prefs.cache");
Where execute_command is a webmin function to basically run a special system call. $pref is the preference name such as "SERVERNAME", "OPTION2", etc. and $in{Pref} is going to be the option I want set for the PREF. For example here is a typical pserver.prefs:
SERVERNAME "Test Name"
OWNERPASSWORD "Hd8sdH&3"
Therefore, if we wanted to change SERVERNAME to say Tes"t#&^"#'"##& and OWNERPASSWORD to *#(&'"#$"(')29 then they would be passed in as $in{Pref}. What is the easiest way to escape the $in{} variables so that they can work OK with sed, or better yet, what is a way I can convert my sed command to a strictly Perl command so that it doesn't have errors?
Update:
Awesome, now I'm just trying to get it to work with and I get this error:
**/bin/sh: -c: line 0: unexpected EOF while looking >for matching `"' /bin/sh: -c: line 1: syntax error: unexpected end of file**
This does not work:
my $Pref = "&*())(*&'''''^%$##!";
&execute_command("perl -pi -e 's/^SERVERNAME.*\$/SERVERNAME \"\Q$Pref\E\"/g' $DIR/pserver.prefs");
This does:
my $Pref = "&*())(*&^%$##!";
&execute_command("perl -pi -e 's/^SERVERNAME.*\$/SERVERNAME \"\Q$Pref\E\"/g' $DIR/pserver.prefs");
Perl's regex support includes the \Q and \E operators, which will cause it to avoid interpreting regex symbols within their scope, yet they allow variable interpolation.
This works:
$i = '(*&%)*$£(*';
if ($i =~ /\Q$i\E/){
print "matches!\n";
}
Without the \Q and \E, you'd get an error because of the regex symbols in $i.
The most trivial part is simply to stop executing a command as a single string. Get the shell out of it. Assuming your execute_command function just calls system under the covers, try:
execute_command(qw/perl -pi -e/, 's/^SERVERNAME.*$/SERVERNAME "\Q$Pref\E"/g', "$DIR/pserver.prefs");
That's better, but not perfect. After all, the user could put in something silly like "#[system qw:rm -rf /:]" and then silly things would happen. I think there are ways around this, too, but the most trivial might be to simply do the work inside your code. How to do that? Maybe starting with what perl is doing with the "-pi" flags might help. Let's take a peek:
$ perl -MO=Deparse -pi -e 's/^SERVERNAME.*$/SERVERNAME "\Qfoo\E"/'
BEGIN { $^I = ""; }
LINE: while (defined($_ = <ARGV>)) {
s/^SERVERNAME.*$/SERVERNAME "foo"/;
}
continue {
print $_;
}
Maybe you can do the same thing in your code? Not sure how easy that is to replicate, especially that $^I bit. Worst case scenario, read the file, write to a new file, delete the original file, rename the new file to the original name. That'll help get rid of all the exposures of passing dangerous junk around.

Why does my perl one-liner ignore the first line of input?

I'm trying to strip some content from an HTML file automatically, and I'm using the following command to strip everything up to the useful data:
perl -pi.bak -e 'undef $/; s/^.*?<pre>//s' $file
However, for some reason this leaves the first line of the HTML file (the DOCTYPE declaration) alone.
By the time you undef $/, the first line has already been read. Use the -0 option to set $/ before anything has been read.
perl -p0i.bak -e 's/^.*?<pre>//s'