How can I extract all quotations in a text? - perl

I'm looking for a SimpleGrepSedPerlOrPythonOneLiner that outputs all quotations in a text.
Example 1:
echo “HAL,” noted Frank, “said that everything was going extremely well.” | SimpleGrepSedPerlOrPythonOneLiner
stdout:
"HAL,"
"said that everything was going extremely well.”
Example 2:
cat MicrosoftWindowsXPEula.txt | SimpleGrepSedPerlOrPythonOneLiner
stdout:
"EULA"
"Software"
"Workstation Computer"
"Device"
"DRM"
etc.
(link to the corresponding text).

I like this:
perl -ne 'print "$_\n" foreach /"((?>[^"\\]|\\+[^"]|\\(?:\\\\)*")*)"/g;'
It's a little verbose, but it handles escaped quotes and backtracking a lot better than the simplest implementation. What it's saying is:
my $re = qr{
" # Begin it with literal quote
(
(?> # prevent backtracking once the alternation has been
# satisfied. It either agrees or it does not. This expression
# only needs one direction, or we fail out of the branch
[^"\\] # a character that is not a dquote or a backslash
| \\+ # OR if a backslash, then any number of backslashes followed by
[^"] # something that is not a quote
| \\ # OR again a backslash
(?>\\\\)* # followed by any number of *pairs* of backslashes (as units)
" # and a quote
)* # any number of *set* qualifying phrases
) # all batched up together
" # Ended by a literal quote
}x;
If you don't need that much power--say it's only likely to be dialog and not structured quotes, then
/"([^"]*)"/
probably works about as well as anything else.

No regexp solution will work if you have nested quotes, but for your examples this works well
$ echo \"HAL,\" noted Frank, \"said that everything was going extremely well\"
| perl -n -e 'while (m/(".*?")/g) { print $1."\n"; }'
"HAL,"
"said that everything was going extremely well"
$ cat eula.txt| perl -n -e 'while (m/(".*?")/g) { print $1."\n"; }'
"EULA"
"online"
"Software"
"Workstation Computer"
"Device"
"multiplexing"
"DRM"
"Secure Content"
"DRM Software"
"Secure Content Owners"
"DRM Upgrades"
"WMFSDK"
"Not For Resale"
"NFR,"
"Academic Edition"
"AE,"
"Qualified Educational User."
"Exclusion of Incidental, Consequential and Certain Other Damages"
"Restricted Rights"
"Exclusion des dommages accessoires, indirects et de certains autres dommages"
"Consumer rights"

grep -o "\"[^\"]*\""
This greps for " + anything except a quote, any number of times + "
The -o makes it only output the matched text, not the whole line.

grep -o '"[^"]*"' file
The option '-o' print only pattern

Related

How to get the gateway IP with Perl?

I need to know the gateway of the localhost
I tried with a system command and with a IP routing table but nothing.
system("ipconfig | findstr /i "Gateway"")
I except the output was the gateway but I got Bareword found where operator expected at script.pl line 63, near ""ipconfig | findstr /i "Gateway"
(Missing operator before Gateway?)
String found where operator expected at script.pl line 63, near "Gateway"""
syntax error at script.pl line 63, near ""ipconfig | findstr /i "Gateway"
Execution of script.pl aborted due to compilation errors.
Intro
findstring is useless, as perl is a wonderfull grep engine...
Under Linux, I will do:
my $gw;
open my $ipr,"ip r|";
while (<$ipr>) {
$gw=$1 if /default.*via ([0-9.]+) /;
};
print $gw."\n";
As your question stand for ipconfig, I think something like
open my $ipr,"ipconfig /all|";
while (<$ipr>) {
$gw=$1 if /[dD].*faul?t.*: ([0-9.]+) *$/;
};
print $gw."\n";
Nota: Regex is a try based on fr.wikipedia and en.wikipedia. Feedback welcome!
Grouped
my $gw;
my $regex='default.*via ([0-9.]+) ';
my $cmd='ip r';
if ($^O =~ "MSWin") {
$regex='[dD].*faul?t.*: ([0-9.]+) *$';
$cmd='ipconfig /all'
};
open my $ipr,$cmd."|";
while (<$ipr>) {
$gw=$1 if /$regex/;
};
print $gw."\n";
This work under my Debian Linux. No idea if this could work under MSWin... Feedback welcome!
Or by using traceroute:
use Net::Traceroute;
$tr = Net::Traceroute->new(host => "8.8.8.8",max_ttl=>1);
print "Gateway: " . $tr->hop_query_host(1,0) . "\n";
I see no-one has actually explained what your problem is.
You cannot use plain double-quote characters inside a double-quoted string. If you think about it, it should be obvious that the first double-quote character inside a double-quoted string will be seen as the end of the string.
Your code is like this:
system("ipconfig | findstr /i "Gateway"")
This is seen as a double-quoted string ("ipconfig | findstr /i") followed by a bareword (Gateway) and another double-quoted string (an empty string - ""). This is never going to compile successfully.
The easiest fix is to change your double-quoted string to a single-quoted string:
system('ipconfig | findstr /i "Gateway"')
But, as others have pointed out, it seems like a very strange idea to use findstr when you have the whole power of Perl available.

Perl regex directly escaping special characters

A perl beginner here. I have been working on some simple one-liners to find and replace text in a file. I read about escaping all special characters with \Q\E or quotemeta() but found this only works when interpolating a variable. For example when I try to replace the part containing special characters directly, it fails. But when I store it in a scalar first it works. Of course, if I escape all the special character in backslashes it also works.
$ echo 'One$~^Three' | perl -pe 's/\Q$~^\E/Two/'
One$~^Three
$ echo 'One$~^Three' | perl -pe '$Sub=q($~^); s/\Q$Sub\E/Two/'
OneTwoThree
$ echo 'One$~^Three' | perl -pe 's/\$\~\^/Two/'
OneTwoThree
Can anyone explain this behavior and also show if any alternative exists that can directly quote special characters without using backslashes?
Interpolation happens first, then \Q, \U, \u, \L and \l.
That means
"abc\Qdef$ghi!jkl\Emno"
is equivalent to
"abc" . quotemeta("def" . $ghi . "!jkl") . "mno"
So,
s/\Q$~^/Two/ # not ok quotemeta($~ . "^")
s/\Q$Sub/Two/ # ok
s/\$\~\^/Two/ # ok
s/\$\Q~^/Two/ # ok

why does changing from ' to " affect the behavior of this one-liner?

Why is it that simply changing from enclosing my one-liner with ' instead of " affects the behavior of the code? The first line of code produces what is expected and the second line of code gives (to me!) an unexpected result, printing out an unexpected array reference.
$ echo "puke|1|2|3|puke2" | perl -lne 'chomp;#a=split(/\|/,$_);print $a[4];'
puke2
$ echo "puke|1|2|3|puke2" | perl -lne "chomp;#a=split(/\|/,$_);print $a[4];"
This is the Perl version:
$ perl -v
This is perl, v5.10.1 (*) built for x86_64-linux-thread-multi
ARRAY(0x1f79b98)
With double quotes you are letting the shell interpolate variables first.
As you can check, $_ and $a are unset in the subshell forked for pipe by the parent shell. See a comment on $_ below.
So the double-quoted version is effectively
echo "puke|1|2|3|puke2" | perl -lne 'chomp;#a=split(/\|/);print [4];'
what prints the arrayref [4].
A comment on the effects of having $_ exposed to Bash. Thanks to Borodin for bringing this up.
The $_ is one of a handful of special shell parameters in Bash. It contains the last argument of the previous command, or the pathname of what invoked the shell or commands (via _ environment variable). See the link for a full description.
However, here it is being interpreted in a subshell forked to run the perl command, its first. Apparently it is not even set, as seen with
echo hi; echo hi | echo $_
which prints an empty line (after first hi). The reason may be that the _ environment variable just isn't set for a subshell for a pipe, but I don't see why this would be the case. For example,
echo hi; (echo $_)
prints two lines with hi even though ( ) starts a subshell.
In any case, $_ in the given pipeline isn't set.
The split part is then split(/\|/), so via default split(/\|/, $_) -- with nothing to split. With -w added this indeed prints a warning for use of uninitialized $_.
Note that this behavior depends on the shell. The tcsh won't run this with double quotes at all. In ksh and zsh the last part of pipeline runs in the main shell, not a subshell, so $_ is there.
This is actual a shell topic, not a perl topic.
In shell:
Single quotes preserve the literal value of all of the characters they contain, including the $ and backslash. However, with double quotes, the $, backtick, and backslash characters have special meaning.
For example:
'\"' evaluates to \"
whereas
"\'" evaluates to just '
because with double quotes, the backslash gets a special meaning as the escape character.

Linux shell: change Perl code to linux shell, grep line by line

The follwoing code is Perl script, grep lines with 'Stage' from hostlog. and then line by line match the content with regex, if find add the count by 1:
$command = 'grep \'Stage \' '. $hostlog;
#stage_info = qx($command);
foreach (#stage_info) {
if ( /Stage\s(\d+)\s(.*)/ ) {
$stage_number = $stage_number+1;
}
}
so how to do this in linux shell? Based on my test, the we can not loop line by line, since there is space inside.
That is a horrible piece of Perl code you've got there. Here's why:
It looks like you are not using use strict; use warnings;. That is a huge mistake, and will not prevent errors, it will just hide them.
Using qx() to grep lines from a file is a completely redundant thing to do, as this is what Perl does best itself. "Shelling out" a process like that most often slows your program down.
Use some whitespace to make your code readable. This is hard to read, and looks more complicated than it is.
You capture strings by using parentheses in your regex, but you never use these strings.
Re: $stage_number=$stage_number+1, see point 3. And also, this can be written $stage_number++. Using the ++ operator will make your code clearer, will prevent the uninitialized warnings, and save you some typing.
Here is what your code should look like:
use strict;
use warnings;
open my $fh, "<", $hostlog or die "Cannot open $hostlog for reading: $!";
while (<$fh>) {
if (/Stage\s\d+/) {
$stage_number++;
}
}
You're not doing anything with the internal captures, so why bother? You could do everything with a grep:
$ stage_number=$(grep -E 'Stage\s\d+\s' | wc -l)
This is using extended regular expressions. I believe the GNU version takes these without a -E parameter, and in Solaris, even the egrep command might not quite allow for this regular expression.
If there's something more you have to do, you've got to explain it in your question.
If I understand the issue correctly, you should be able to do this just fine in the shell:
while read; do
if echo ${REPLY} | grep -q -P "'Stage' "; then
# Do what you need to do
fi
done < test.log
Note that if your grep command supports the -P option you may be able to use the Perl regular expression as-is for the second test.
this is almost it. bash has no expression for multiple digits.
#!/bin/bash
command=( grep 'Stage ' "$hostlog" )
while read line
do
[ "$line" != "${line/Stage [0-9]/}" ] && (( ++stage_number ))
done < <( "${command[#]}" )
On the other hand taking the function of the perl script into account rather than the operations it performs the whole thing could be rewritten as
(( stage_number += ` grep -c 'Stage \d\+\s' "$hostlog" ` ))
or this
stage_number=` grep -c 'Stage \d\+\s' "$hostlog" `
if, in the original perl, stage_number is uninitialised, or is initalised to 0.

How to quote a perl $symbol in a makefile

In a Makefile, I have a rule to make a figure list from a LaTeX paper by piping the
output from a script to a perl expression that increments figure numbers $f++ and prepends Figure $f: to the lines.
From a command line, it works fine, as follows:
% texdepend -format=1 -print=f MilestonesProject | perl -pe 'unless (/^#/){$f++; s/^/Figure $f: /}' > FIGLIST
generating FIGLIST:
# texdepend, v0.96 (Michael Friendly (friendly#yorku.ca))
# commandline: texdepend -format=1 -print=f MilestonesProject
# FIGS =
Figure 1: fig/langren-google-overlay2.pdf
Figure 2: fig/mileyears4.png
Figure 3: fig/datavis-schema-3.pdf
Figure 4: fig/datavis-timeline2.png
...
I can't figure out how to make this work in a Makefile, because the $f stuff in the perl expression gets interpreted by make and I can't figure out how to quote it or otherwise
make it invisible to make.
My most recent attempt in my Makefile:
## Generate FIGLIST; doesnt work due to Make quoting
FIGLIST:
$(TEXDEPEND) -format=1 -print=f $(MAIN) | perl -pe 'unless (/^#/){\$f++; s/^/Figure \$f: /}' > FIGLIST
Can someone help?
-Michael
Double the dollar signs.
## Generate FIGLIST
FIGLIST:
$(TEXDEPEND) -format=1 -print=f $(MAIN) \
| perl -pe 'unless (/^\#/){$$f++; s/^/Figure $$f: /}' > $#
You may need to backslash-escape the comment sign as well. I did so just in case.
See also http://www.gnu.org/software/make/manual/html_node/Variables-in-Recipes.html#Variables-in-Recipes