I have one folder it contain 'n' number of html files. I'll read the files and take the one line. (i.e) I'll take the <img /> tag in one array and print the array. Now doesn't print the array. Can you help me. My code is here.
use strict;
use File::Basename;
use File::Path;
use File::Copy;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Excel';
print "Welcome to PERL program\n";
#print "\n\tProcessing...\n";
my $foldername = $ARGV[0];
opendir(DIR,$foldername) or die("Cannot open the input folder for reading\n");
my (#htmlfiles) = grep/\.html?$/i, readdir(DIR);
closedir(DIR);
#htmlfiles = grep!/(?:index|chapdesc|listdesc|listreview|addform|addform_all|pattern)\.html?$/i,#htmlfiles;
# print "HTML file is #htmlfiles";
my %fileimages;
my $search_for = 'img';
my $htmlstr;
for my $files (#htmlfiles)
{
if(-e "$foldername\\$files")
{
open(HTML, "$foldername\\$files") or die("Cannot open the html files '$files' for reading");
local undef $/;my $htmlstr=<HTML>;
close(HTML);
$fileimages{uc($2)}=[$1,$files] while($htmlstr =~/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi);
}
}
In command prompt.
perl findtext.pl "C:\viji\htmlfiles"
regards, viji
I would like to point out that parsing HTML with regexes is futile. See the epic https://stackoverflow.com/a/1732454/1521179 for the answer.
Your regex to extract image tags is quite broken. Instead of using a HTML parser and walking the tree, you search for a string that…
/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi
begins with <img
after exactly one space, the sequence id=" is found. The contents of that attribute are captured if it is found, else the match fails. The closing " is consumed.
after exactly one space, the sequence src="./images/ is found,
followed by a character that is not t. (This allows for ", of course).
This is followed by any number of any characters that are not slashes or <> characters (This allows for ", again),
followed by a slash.
now capture this:
one or more characters that are not dots
followed by the suffix .jpg
after which " has to follow immediately.
false positives
Here is some data that your regex will match, where it shouldn't:
<ImG id="" src="./ImAgEs/s" alt="foo/bar.jpg"
So what is the image path you get? ./ImAgEs/s" alt="foo/bar.jpg may not be what you wanted.
<!-- <iMg id="" src="./images/./foobar.jpg" -->
Oops, I matched commented content. And the path does not contain a subfolder of ./images. The . folder is completely valid in your regex, but denotes the same folder. I could even use .., what would be the folder of your HTML file. Or I could use ./images/./t-rex/image.jpg what would match a forbidden t-folder.
false negatives
Here is some data you would want, but that you won't get:
<img
id="you-cant-catch-me"
src='./images/x/awesome.jpg' />
Why? Newlines—but you only allow for single spaces between the parameters. Also, you don't allow for single quotes '
<img src="./images/x/awesome.jpg" id="you-cant-catch-me" />
Why? I now have single spaces, but swapped the arguments. But both these fragments denote the exact same DOM and therefore should be considered equivalent.
Conclusion
go to http://www.cpan.org/ and search for HTML and Tree. Use a module to parse your HTML and walk the tree and extract all matching nodes.
Also, add a print statement somewhere. I found a
use Data::Dumper;
print Dumper \%fileimages;
quite enlightening for debug purposes.
Related
The following working code reads my XML file containing lots of empty elements, then applies 2 changes and saves it again under different name.
But it also changes empty elements like <element></element> to self-closing tags like <element /> which is unwanted.
How to save it not using self-closing tags?
Or by another words how to tell XML::LibXML to use empty tags?
The original file is produced in commercial application, which uses style with empty elements, so I want to sustain that.
#! /usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $filename = 'out.xml';
my $dom = XML::LibXML->load_xml(location => $filename);
my $query = '//scalar[contains(#name, "partitionsNo")]/value';
for my $i ($dom->findnodes($query)) {
$i->removeChildNodes();
$i->appendText('16');
}
open my $out, '>', 'out2.xml';
binmode $out;
$dom->toFH($out);
# now out2.xml has only self-closing tags where previously
# were used empty elements
Unfortunately, XML::LibXML doesn't support libxml2's xmlsave module which has a flag to save without empty tags.
As a workaround you can add an empty text node to empty elements:
for my $node ($doc->findnodes('//*[not(node())]')) {
# Note that appendText doesn't work.
$node->appendChild($doc->createTextNode(''));
}
This is a bit costly for large documents, but I'm not aware of a better solution.
That said, the fragments <foo></foo> and <foo/> are both well-formed and semantically equivalent. Any XML parser or application that treats such fragments differently is buggy.
Note that some people believe the XML spec recommends using self-closing tags, but that's not exactly true. The XML spec says:
Empty-element tags may be used for any element which has no content, whether or not it is declared using the keyword EMPTY. For interoperability, the empty-element tag should be used, and should only be used, for elements which are declared EMPTY.
This means elements that are declared EMPTY in a DTD. For other elements, or if no DTD is present, the XML standard advises not to use self-closing tags ("and should only be used"). But this is only a non-binding recommendation for interoperability.
There is a package variable
$XML::LibXML::setTagCompression
Setting it to a true value forces all empty tags to be printed as <e></e>, while a false value forces <e/>.
See SERIALIZATION in the Parser documentation.
I'm trying to write an Automator service, so I can chuck this into a right-click menu in the gui.
I have a filepath to a txt file, and there is a similarly named file that varies only in the file extension. This can be a pdf or a jpg, or potentially any other extension, no way to know beforehand. How can I get the filepath to this other file (there will only be one such)?
$other_name =~ s/txt$/!(txt)/;
$other_name =~ s/ /?/g;
my #test = glob "$other_name";
In Bash, I'd just turn on the extglob option, and change the "txt" at the end to "!(txt)" and the do glob expansion. But I'm not even sure if that's available in perl. And since the filepaths always have spaces (it's in one of the near-root directory names), that further complicates things. I've read through the glob() documentation at http://perldoc.perl.org/functions/glob.html and tried every variation of quoting (the above example code shows my attempt after having given up, where I just remove all the spaces entirely).
It seems like I'm able to put modules inside the script, so this doesn't have to be bare perl (just ran a test).
Is there an elegant or at least simple way to accomplish this?
You can extract everything in the filename up to extension, then run a glob with that and filter out the unneeded .txt. This is one of those cases where you need to protect the pattern in the glob with a double set of quotes, for spaces.
use warnings;
use strict;
use feature qw(say);
my $file = "dir with space/file with spaces.txt";
# Pull the full name without extension
my ($basefname) = $file =~ m/(.*)\.txt$/;
# Get all files with that name and filter out unneeded (txt)
my #other_exts = grep { not /\.txt$/ } glob(qq{"$basefname.*"});
say for #other_exts;
With a toy structure like this
dir space/
file with spaces.pdf
file with spaces.txt
The output is
dir space/file with spaces.pdf
This recent post has more on related globs.
Perl doesn't allow the not substring construct in glob. You have to find all files with the same name and any extension, and remove the one ending with .txt
This program shows the idea. It splits the original file name into a stem part and a suffix part, and uses the stem to form a glob pattern. The grep removes any result that ends with the original suffix
It picks only the first matching file name if there is more than one candidate. $other_name will be set to undef if no matching file was found
The original file name is expected as a parameter on the command line
The result is printed to STDOUT; I don't know what you need for your right-click menu
The line use File::Glob ':bsd_glob' is necessary if you are working with file paths that contain spaces, as it seems you are
use strict;
use warnings 'all';
use File::Glob ':bsd_glob';
my ($stem, $suffix) = shift =~ /(.*)(\..*)/;
my ($other_name) = grep ! /$suffix$/i, glob "$stem.*";
$other_name =~ tr/ /?/;
print $other_name, "\n";
This is an example, based on File::Basename core module
use File::Basename;
my $fullname = "/path/to/my/filename.txt";
my ($name, $path, $suffix) = fileparse($fullname, qw/.txt/);
my $new_filename = $path . $name . ".pdf";
# $name --> filename
# $path --> /path/to/my/
# $suffix --> .txt
# $new_filename --> /path/to/my/filename.pdf
For the life of me I can't figure out the proper code to access the comment lines in my XML file. Do I use findnodes, find, getElementByTagName (doubt it).
Am I even making the correct assumption that these comment lines are accessible? I would hope so, as I know I can add a comment.
The type number for a comment node is 8, so they must be parseable.
Ultimately, what I want tot do is delete them.
my #nodes = $dom->findnodes("//*");
foreach my $node (#nodes) {
print $node->nodeType, "\n";
}
<TT>
<A>xyz</A>
<!-- my comment -->
</TT>
If all you need to do is produce a copy of the XML with comment nodes removed, then the first parameter of toStringC14N is a flag that says whether you want comments in the output. Omitting all parameters implicitly sets the first to a false value, so
$doc->toStringC14N
will reproduce the XML trimmed of comments. Note that the Canonical XML form specified by C14N doesn't include an XML declaration header. It is always XML 1.0 encoded in UTF-8.
If you need to remove the comments from the in-memory structure of the document before processing it further, then findnodes with the XPath expression //comment() will locate them for you, and unbindNode will remove them from the XML.
This program demonstrates
use strict;
use warnings;
use XML::LibXML;
my $doc = XML::LibXML->load_xml(string => <<END_XML);
<TT>
<A>xyz</A>
<!-- my comment -->
</TT>
END_XML
# Print everything
print $doc->toString, "\n";
# Print without comments
print $doc->toStringC14N, "\n\n";
# Remove comments and print everything
$_->unbindNode for $doc->findnodes('//comment()');
print $doc->toString;
output
<?xml version="1.0"?>
<TT>
<A>xyz</A>
<!-- my comment -->
</TT>
<TT>
<A>xyz</A>
</TT>
<?xml version="1.0"?>
<TT>
<A>xyz</A>
</TT>
Update
To select a specific comment, you can add a predicate expression to the XPath selector. To find the specific comment in your example data you could write
$doc->findnodes('//comment()[. = " my comment "]')
Note that the text of the comment includes everything except the leading and trailing --, so spaces are significant as shown in that call.
If you want to make things a bit more lax, you could use normalize=space, which removes leading and trailing whitespace, and contracts every sequence of whitespace within the string to a single space. Now you can write
$doc->findnodes('//comment()[normalize-space(.) = "my comment"]')
And the same call would find your comment even if it looked like this.
<!--
my
comment
-->
Finally, you can make use of contains, which, as you would expect, simply checks whether one string contains another. Using that you could write
$doc->findnodes('//comment()[contains(., "comm")]')
The one to choose depends on your requirement and your situation.
According to the XPath spec:
* is a test that matches element nodes of any name. Comment nodes aren't element nodes.
comment() is a test that matches comment nodes.
Untested:
for $comment_node ($doc->findnodes('//comment()')) {
$comment_node->parentNode->removeChild($comment_node);
}
I know it's not XML::LibXML but here you have another way to remove comments easily with XML::Twig module:
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my $twig = XML::Twig->new(
pretty_print => 'indented',
comments => 'drop'
)->parsefile( shift )->print;
Run it like:
perl script.pl xmlfile
That yields:
<TT>
<A>xyz</A>
</TT>
The comments option has also the value process that lets you work with them using the xpath value of #COMMENT.
I am using perl to address some text files. I want to use perl filename wild card to find all the useful files in a folder and address them one by one, but my there are spaces in the filename. Then I find the filename wildcard cannot address those filenames properly. Here is my code:
my $term = "Epley maneuver";
my #files = <rawdata/*$term*.csv>;
my $infiles;
foreach $infilename (#files) {
if($infilename =~ m/(\d+)_.*\.csv/)
{
$infiles{$infilename} = $1;
print $1."\n";
}
}
The filename are like:
34_Epley maneuver_2012_4_6.csv
33_Epley maneuver_2012_1_3.csv
32_Epley maneuver_2011_10_12.csv
...
They are in a folder named "rawdata".
When I used this for terms that don't contain spaces, like "dizzy", it works well. But when the term contains space, it just stop working. I searched this on Google, but find little useful information.
What happens and how can I do this correctly?
Any help will be good. Thanks a lot.
The glob operator works like the command-line processor. If you write <rawdata/*Epley maneuver*.csv> it will look for files that match rawdata/*Epley or maneuver*.csv
You must put your glob expression in double-quotes:
my #files = <"rawdata/*$term*.csv">
I have two inputs reading into my command prompt, the first being a series of words that are to be searched by the program I'm writing, and the second being the file that contains where the words are to be found. So, for instance, my command prompt reads perl WebScan.pl word WebPage000.htm
Now, I have no trouble accessing either of these inputs for printing, but I am having great difficulty accessing the contents of the webpage so I can perform regular expressions to remove html tags and access the content. I realize that there is a subroutine available to do this without regular expressions that is far more effective, but I need to do with with regular expressions :(.
I can access the html file for printing with no trouble:
open (DATA, $ARGV[1]);
my #file = <DATA>;
print #file;
Which prints the entire code of the html page, but I am unable to pass regular expressions in order to remove html blocks. I keep receiving an error that says "Can't modify array dereference in s/// near," which is where I have my specific regular expression. I'm not sure how to get around this- I've tried converting the array into a scalar but then I am unable to access any of the data in the html at all (and no, it doesn't just print the number of values in the array :P)
How do I access the array's contents so I can use regular expressions to refine the desired output?
It sounds like you are doing something like #file =~ s/find/replace/;. You are getting that error because the left hand side of the regex binding operator imposes scalar context on its argument. An array in scalar context returns its length, but this value is read only. So when your substitution tries to perform the replacement, kaboom.
In order to process all of the lines of the file, you could use a foreach loop:
foreach my $line (#file) {$line =~ s/find/replace/}
or more succinctly as:
s/find/replace/ for #file;
However, if you are running regular expressions on an HTML file, chances are you will need them to match across multiple lines. What you are doing above is reading the entire file in, and storing each line as an element of #file. If you use one of Perl's iterative control structures on the array, you will not be able to match multiple lines. So you should instead read the file into a single scalar. You can then use $file =~ s/// as expected.
You can slurp the file into a single variable by temporarily clearing the input record separator $/:
my $file = do {local $/; <DATA>};
In general, regular expressions are the wrong tool for parsing HTML, but it sounds like this is a homework assignment, so in that case its just practice anyway.
And finally, in modern Perl, you should use the three argument form of open with a lexical file handle and error checking:
open my $DATA, '<', $ARGV[1] or die "open error: $!";
my $file = do {local $/; <$DATA>};