perl lazy match external file with regular expression - perl

I have an index.html file that contains a string like:
vendor.adsf34.bundle.js blah blah inline.1r34afer.bundle.js
I've built the following code (mac, iterm2):
perl -i -pe 's/vendor\..+?\.js/vendor.js/g;' index.html
perl -i -pe 's/inline\..+?\.js/inline.js/g;' index.html
However I end up with:
vendor.js
ie, it seems to be greedy matching where I need it to be lazy.
What I'm trying to do is shorten the js names, eg as follows:
vendor.js blah blah inline.js
Be great to get some pointers!

Something like this maybe?
look for a fixed start string (vendor. or inline.) 1
look for one-or-more non-space characters
look for a fixed end string (.js)
replace it with 1 + .js
$ echo "vendor.adsf34.bundle.js blah blah inline.1r34afer.bundle.js" | perl -pe 's/(vendor|inline)\.\S+\.js/\1.js/g'
vendor.js blah blah inline.js

Related

Perl code to delete a multi line XML node

I have an xml file test.xml
<many-nested-roots>
<foo>
<bar>
</bar>
</foo>
<other-random-nodes></other-random-nodes>
<foo>
<bar>
<foobar>
</foobar>
</bar>
</foo>
<!-- multiple such blocks not in any particular order -->
</many-nested-roots>
I need to delete xml node <foo><bar></bar></foo> but not <foo><bar><foobar></foobar></bar></foo>.
EDIT: The node <foo><bar></bar></foo> occurs multiple times and randomly across a heavily nested XML.
What I tried which doesn't work:
perl -ne 'print unless /^\s*<foo>\n\s*<bar>\n\s*<\bar>\n\s*<\/foo>/' test.xml
^ This doesn't match for newline
perl -ne 'print unless /<foo>/ ... /<\/foo>/' test.xml
^ This deletes all the tags including <foobar>
perl -ne 'print unless /<foo>.*?<bar>.*?<\/bar>.*?<\/foo>/s' test.xml
^ I used /s to let . match for newline. Doesn't work.
A one-liner using XML::LibXML and an XPath expression to find the nodes to delete:
perl -MXML::LibXML -E '
my $dom = XML::LibXML->load_xml(location => $ARGV[0]);
$_->unbindNode for $dom->documentElement->find("//foo/bar[count(*)=0]/..")->#*;
print $dom->serialize' test.xml
(Old versions of perl need #{$dom->...} instead of $dom->...->#*)
Or using xmlstarlet (not perl, but very handy for scripted manipulation of XML files):
xmlstarlet ed -d '//foo/bar[count(*)=0]/..' test.xml
As #Shawn and #tshiono said, you should not use regex but a XML parser. Here is an example, but not a one-liner, using Mojo::DOM provided by Mojolicious:
#!/usr/bin/env perl
use Mojo::Base -strict, -signatures;
use Mojo::DOM;
use Mojo::File 'path';
my $dom = Mojo::DOM->new->xml(1)->parse(path($ARGV[0])->slurp);
$dom->find("foo bar")->each(
sub ($el, $i) { $el->parent->remove if $el->children->size == 0 }
);
print $dom;
If you save it as myscript.pl you can call it with ./myscript.pl test.xml.
Would you please try:
perl -0777 -pe s'#<foo>\s*<bar>\s*</bar>\s*</foo>\s*##g' test.xml
The -0777 option tells perl to slurp whole file at once to make the regex match across lines.
Please note it is not recommended to parse XML files with regex. Perl has several modules to handle XML files such as XML::Simple. As a standalone program, XMLstarlet will be a nice tool to manipulate XML files.

write filename as first line in a txt file + text around it / osx perl

I'm complete newby to perl and I hope you can help me with this line of code.
The issue is related to this one here, but it doesn't quite answer my question. I tried looking for it, but I just get more confused.
I have a txt input (batch) that I want to have a filename printed in the first line, but wrapped in a specific text. I am converting these files later into html and so I would like the .name to have "<div class="head">" printed before and "</div>" afterwards.
Here is the code I have and it works to print the name:
perl -i -pe 'BEGIN{undef $/;} s/^/$ARGV\n/' `find . -name '*.txt'`
I run this by first navigating to the directory where all the files are.
example of filename: 2016-05-20_18.32.08.txt
the files are plane text poetry and in the output i get:
./2016-05-20_18.32.08.txt
in the first line.
I tried something like this:
perl -i -pe 'BEGIN{undef $/;} s/^/$ARGV\n/' `find . -name ‘“<div class="head”>”’*.txt’”</div>”’
but of course it doesn't work. it just give me a >
I need to add the arguments in this part s/^/$ARGV\n/' but i already have troubles defining it.
Can you help pls?
In addition, the filename prints with ./ in the beginning, is there a simple way to exclude that?
perl -i -pe 'BEGIN{undef $/;} s/^/<div class=head> $ARGV <\/div>\n<div class=poem>\n/; s/$/\n<\/div>/' `find . -name '*.txt'`
This should work. But if you are new to perl, I suggest you try working with scripts rather than one-liners.
The -i flag will edit the file inplace. so if you want a html file, remove -i and redirect to another .html file.
I'm sure there is a more elegant way of doing it, but something like this will work
#!/usr/bin/perl
undef $/;
for (#ARGV){
open($fh,$_);
$content=<$fh>;
close($fh);
open($fh,">$_");
print $fh "<div class=\"head\">$_</head>\n$content";
close($fh)
}

Perl regex directly escaping special characters

A perl beginner here. I have been working on some simple one-liners to find and replace text in a file. I read about escaping all special characters with \Q\E or quotemeta() but found this only works when interpolating a variable. For example when I try to replace the part containing special characters directly, it fails. But when I store it in a scalar first it works. Of course, if I escape all the special character in backslashes it also works.
$ echo 'One$~^Three' | perl -pe 's/\Q$~^\E/Two/'
One$~^Three
$ echo 'One$~^Three' | perl -pe '$Sub=q($~^); s/\Q$Sub\E/Two/'
OneTwoThree
$ echo 'One$~^Three' | perl -pe 's/\$\~\^/Two/'
OneTwoThree
Can anyone explain this behavior and also show if any alternative exists that can directly quote special characters without using backslashes?
Interpolation happens first, then \Q, \U, \u, \L and \l.
That means
"abc\Qdef$ghi!jkl\Emno"
is equivalent to
"abc" . quotemeta("def" . $ghi . "!jkl") . "mno"
So,
s/\Q$~^/Two/ # not ok quotemeta($~ . "^")
s/\Q$Sub/Two/ # ok
s/\$\~\^/Two/ # ok
s/\$\Q~^/Two/ # ok

how can i use perl or grep for extracting all links in a file

How can I get all the links from a file that has only one line.
For example file content :
ABC def WWW.----link_1----.html ARE ABC def WWW.---link_2----.html ABC def WWW.---link_3---.html
I have this command so far:
perl -pe 's/.*(WWW.*?.html).*/$1/' file_name
this only gives me:
WWW.---link_1-----.html
the output i want is each link in separate line :
link_1
link_2
link_3
You can use /g modifier to match every link occurrence,
perl -lne 'print for /(WWW.*?[.]html)/g' file_name
output
WWW.----link_1----.html
WWW.---link_2----.html
WWW.---link_3---.html

Match a string, skip if it has a . (DOT) infront of the result

Here's what I use to match a string in a variable and delete the line where the match exists:
sed -i '/'"$domainAndSuffix.cfg"'/d' /etc/file
I'd like to know how to match a string in a variable, but if the match in the file has a . adjacent to it on the immediate left, then it will NOT delete this line and keep going through the file until it finds a match without a .
Sample file Contents:
happy.domain.com
pappy.domain.com
domain.com
String to match:
domain.com
Desired File Output:
happy.domain.com
pappy.domain.com
*Edit:
Actual File Contents:
cfg_file=/etc/nagios/objects/http_url/bob.ca.cfg
cfg_file=/etc/nagios/objects/http_url/therecord.com.cfg
cfg_file=/etc/nagios/objects/http_url/events.therecord.com.cfg
cfg_file=/etc/nagios/objects/http_url/read.therecord.com.cfg
cfg_file=/etc/nagios/objects/http_url/wheels.ca.cfg
cfg_file=/etc/nagios/objects/http_url/used-vehicle-search.autos.ca.msn.com.cfg
cfg_file=/etc/nagios/objects/http_url/womensweekendshow.com.cfg
cfg_file=/etc/nagios/objects/http_url/yorkregion.com.cfg
cfg_file=/etc/nagios/objects/http_url/yourclassifieds.ca.cfg
If the preceding substring is fixed, you can try the following:
PREFIX='cfg_file=\/etc\/nagios\/objects\/http_url\/'
DOMAIN='therecord.com'
sed -i "/^${PREFIX}${DOMAIN}/d" file
If it is not fixed, it would be nice to use a negative lookbehind, but sed can't do that. You can use ssed or GNU grep:
ssed -Ri '/(?<!\.)'"$DOMAIN"'.cfg/d' file
or
grep -vP '(?<!\.)'"$DOMAIN" > file1; mv file1 file