How to format xml created from XML::LibXML in perl - perl

I am creating XML data in Perl by using XML::LibXML Module But when i am writing the data into a file want to Pretty-printing it so that is it can be easily readable .
Below is a snapshot how i am creating xml from my perl script:
my $xml = XML::LibXML::Document->new('1.0', 'UTF-8');
$xml->createElement('A');
$elem->setAttribute('B',data)
Is there any way we can format the XML by using XML::LibXML because i have to stick with this module only.

The method XML::LibXML::Document::serialize writes the xml document as text. Its parameter allows for limited control over the format of the output.
XML::LibXML is a veneer to the libxml2 system library. This library comes with a hard-coded indentation of 2 spaces, so unless you create your own pretty-printer your options will be limited.
However, there are a number of standalone utilities that reformat syntactically valid xml and allow more fine-grained control and which can be run as a postprocessor from within perl on a file with the serialized xml. I've been satisfied with xmlstarlet and xmllint.
Another question is whether you really want to invest many resources into the endeavour. If you need the human-readable version for debugging or out-of-order inspection only, loading the data into a browser like Chrome or Firefox may be enough - they run xml data through a very decent pretty-printer.

Related

Index PDF files and generate keywords summary

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

How to convert valgrind output to XML?

Actually I know that there is Test::Valgrind::Parser::XML perl module. But I have no idea how to use it: If anyone can provide documentation it would be great.
The valgrind docs show that valgrind accepts a --xml=yes tag to output messages as XML. The format of the XML is specified in the docs/internals/xml-output-protocol4.txt inside the source code repository.
With that, you can use any XML parser and do whatever you want with the data.

Perl Extracting XML Tag Attribute Using Split Or Regex

I am working on a file upload system that also parses the files that are uploaded and generates another file based on info inside the file uploaded. The files being uploaded as XML files. I only need to parse the first XML tag in each file and only need to get the value of the single attribute in the tag.
Sample XML:
<LAB title="lab title goes here">...</LAB>
I am looking for a good way of extracting the value of the title attribute using the Perl split function or using Regex. I would use a Perl XML parser if I had the ability to install Perl modules on the server I am hosting my code on, however I do not have that ability.
This XML is located in an XML file, that I am opening and then attempting to parse out the attribute value. I have tried using both Split and Regex to no luck. However, I am not very familiar with Perl or regular expressions.
This is he basic outline my code so far:
open(LAB, "<", "path-to-file-goes-here") or die "Unable to open lab.\n";
foreach my $line (<LAB>) {
my #pieces = split(/"(.*)"/, $line);
foreach my $piece (#pieces) {
print "$piece\n";
}
}
I have tried using split to match against title alone using
/title/
Or match against the = character or the " character using
/\=/ or /\"/
I have also tried doing similar things using regex and have had no luck as well. I am not sure if I am just not using the proper expression or if this is not possible using split/regex. Any help on the matter would be much appreciated, as I am admittedly a novice at Perl still. If this type of question has been answered elsewhere, I apologize. I did some searching and could not find a solution. Most threads suggest using an XML parsing Perl module, which I would if I had the privileges to install them.
"But I can't use CPAN" is a quick way to get yourself downvoted on the Perl tag (though it wasn't I who did so). There are many ways that you can use CPAN, even if you don't have root. In fact you can have your own Perl even if you don't have root. While I highly recommend some of those options, for now, the easiest way to do this is just to download some Pure Perl modules, and included them in your codebase. Mojolicious has a very small, but very useful XML/DOM parser called Mojo::DOM which is a likely candidate for this kind of process.

Read excel file without using module

Can I read an excel file without using any module?
I tried like just reading a normal file and it printed binary characters; maybe because of encoding?
But reading csv files is working normally.
Excel files are binary files, and the format of the pre-2007 ones is apparently quite hairy. I believe .xlsx files are actually zipped XML, so unzipping them should yield something human-readable, but I've never tried it. Why do you want to not use a module though?
Some further reading, if you're interested:
http://joelonsoftware.com/items/2008/02/19.html
http://en.wikipedia.org/wiki/Office_Open_XML_file_formats
Can I read an excel file without using any module?
In theory yes. In practice no.
An Excel XLS file is a binary file within a binary file. The first step would be to parse the Excel BIFF data out of the OLE COM document container. This data isn't necessarily in sequential order.
Then you have to parse the Excel BIFF data, allowing for differences between versions, a shared string table with different encodings and CONTINUE blocks that map large data records in a parser unfriendly way.
The Excel XLSX format is a little easier since it is a collection of XML files in a Zip container. However, if you aren't using modules then even that would be a pain.
The Perl modules that deal with Excel files represent hundreds of man hours of work. Expect to invest a similar amount of work to avoid them.
And why can't you use modules?
You can try figuring out the format of what an Excel spreadsheet looks like, code for that, and then use that in your program. Maybe write it as a module and submit it to CPAN. Wait a second! There's already a module like that there!
The whole purpose of CPAN is to prevent you from having to reinvent the wheel. You need to read an Excel spreadsheet, and someone has done the hard work to figure out how to do this, and is giving it to you free of charge. A $40,000 value1, and it's yours for free! The CPAN system makes installing modules fairly simple. You run the cpan command. There's no real reason to avoid modules that can save you hundreds of hours of work.
And, what type of modules do you avoid? Is it all modules, or is it only modules that are not included in the standard distribution. I hate to think you don't use things like File::Copy or Data::Dumper just because they're modules even though they're included by default in most Perl distributions.
1 Imagine hiring a team to write code to convert an Excel file, so it can be read by a Perl program. They'd have to figure the ins and outs of the file format, code for all sorts of edge cases, and run it through all sorts of tests to make sure it really works. A rough estimate if we don't include things like charts, embedded content, and remote data access would be about 200 man-hours, but only because it's actually has been documented.

DocBook to Word Conversion?

I need some help with conversion of DocBook files to Microsoft Word files.
Do I need an XSL file for the transformation?
Yes, you do need an XSL file. You can get XSL files for DocBook from the free DocBook XML distribution. Then, you run a free XSLT transformer such as Saxon. If you run Saxon from a command line, you give it the name of your DocBook file, and the name of one of the stylesheets, and it will transform your file according to the rules in the stylesheet.
What you need to do to transform to Word, is to pick the right stylesheet.
From DocBook XSL: The Complete Guide, here are three possibilities:
Convert to XSL-FO and then use the XMLmind to export to Word. See the XMLmind website for more information.
Use a limited set of tags and then use one of DocBook XML's included stylesheets to output to WordML.
Try to use Jfor to output to RTF, although Jfor no longer appears to be maintained.
And I have one of my own:
As above, use one of DocBook XML's included stylesheets to publish to XSL-FO, then run Apache FOP to convert from XSL-FO to RTF. You will lose the structural information, but you will keep a certain amount of the formatting.
I recently implemented same feature for our users. They use Oxygen XML editor that allows for easy transformations via XSL. I was going to do OOXML but settled on WordML. As a starting point I used roundtrip XSL, but I had to rewrite lots of templates because of existing bugs or just missing functionality. In addition, I did other customization to serve a purpose or for our XML file only.
I would not mind contributing back to the project, but don't really know how to get about it.
I know this is an 11 years old question. But now, in 2022 you can use pandoc to convert DocBook to MS Word (docx).
pandoc --from docbook --to docx --output filename.docx filename.docbook
I am using XQuery to transform DocBook into various formats using XQuery typeswitch library. XQuery uses indexes so I can transform many documents very quickly.