converting html to text with perl - perl

I have a bunch of html files and need to convert and format them to text with perl i.e somthing like <br/> will be interperted to \n
I found this perl module on cpan html::formattext it format the text well but if there is link it strip it ,
are there any option with HTML::FormatText to format the html as is to text but when
there links like this
<a href="http://www.microsoft.com>http://www.microsoft.com</a>
i.e somthing like this :
<br /><b>Microsoft</b><br /><a href="http://www.microsoft.com>`
will be converted to:
microsoft
http://www.microsoft.com

Take a look at HTML::FormatText::WithLinks
Setting the after_link option to, say, " (%l)" will put the link in line after the anchor text. In your example you would get Microsoft (http://www.microsoft.com).

Related

Superscript within code block in Github Markdown

The <sup></sup> tag is used for superscripts. Creating a code block is done with backticks. The issue I have is when I try to create a superscript within a code block, it prints out the <sup></sup> tag instead of formatting the text between the tag.
How do I have superscript text formatted correctly when it's between backticks?
Post solution edit
Desired output:
A2 instead of A<sup>2</sup>
This is not possible unless you use raw HTML.
The rules specifically state:
With a code span, ampersands and angle brackets are encoded as HTML entities automatically, which makes it easy to include example HTML tags.
In other words, it is not possible to use HTML to format text in a code span. In fact, a code span is plain, unformatted text. Having any of that text appear as a superscript would mean it is not plain, unformatted text. Thus, this is not possible by design.
However, the rules also state:
Markdown is not a replacement for HTML, or even close to it. Its
syntax is very small, corresponding only to a very small subset of
HTML tags. The idea is not to create a syntax that makes it easier
to insert HTML tags. In my opinion, HTML tags are already easy to
insert. The idea for Markdown is to make it easy to read, write, and
edit prose. HTML is a publishing format; Markdown is a writing
format. Thus, Markdown's formatting syntax only addresses issues that
can be conveyed in plain text.
For any markup that is not covered by Markdown's syntax, you simply
use HTML itself. ...
So, if you really need some text in a code span to be in superscript, then use raw HTML for the entire span (be sure to escape things manually as required):
<code>A code span with <sup>superscript</sup> text and escaped characters: "<&>".</code>
Which renders as:
A code span with superscript text and escaped characters: "<&>".
This is expected behaviour:
Markdown wraps a code block in both <pre> and <code> tags.
You can use Unicode superscript and subscript characters within code blocks:
class SomeClass¹ {
}
Inputting these characters will depend on your operating system and configuration. I like to use compose key sequences on my Linux machines. As a last resort you should be able to copy and paste them from something like the Wikipedia page mentioned above.
¹Some interesting footnote, e.g. referencing MDN on <pre> and <code> tags.
If you're luck, the characters you want to superscript (or subscript) may have dedicated codepoints in Unicode. These will work inside codeblocks, as demonstrated in your question, where you include A² in backticks. Eg:
Water (chemical formula H₂O) is transparent, tasteless and odourless.
I've listed out the super and subscript Unicode characters in this Gist. You should be able to copy and paste any you need from there.

How to parse .pdf files in Perl?

How to parse .pdf files in Perl?
Is perl is more efficient or should I use any other language?
When I want to extract text from a PDF, I feed it to pdftohtml (part of Poppler) using the -xml output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).
The XML format is fairly simple. You get a <page> element for each page in the PDF, which contains <fontspec> elements describing the fonts used and a <text> element for each line of text. The <text> elements may contain <b> and <i> tags for bold and italic text (which is why XML::Simple can't parse it properly).
You do need to use the top and left attributes of the <text> tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).
I personally use CAM::PDF.
my $doc=CAM::PDF->new($fileName) || die "$CAM::PDF::errStr\n"; CAM::PDF>asciify(/$pdfString);`
Pdfs are not designed for parsing, but for display/printing - thus anything is always try and error and it is quite possible that it is impossible to parse if everything is graphics. A good indicator is if you can copy and paste the content from the pdf into an editor. If this works, then you are in business.
Look at the CPAN and, specifically, if you want to do OCR, see PDF::OCR2
I don't know of any module that parses, that is, if you to extract the text from them. There are a number of modules that let you manipulate them. Try PDF::API2.

regex_replace to replace certain html tags

Is there a way to convert BR tags and/or DIV tags to new lines so it will format correctly when I use an in a mailto? I was thinking I should look for any P, DIV, and BR tags and replace them with a new line character. So anywhere there is a closing tag put the new line character and remove the opening tag. After I do the above I will remove the rest of the html with remove_html="1" but I want to keep the paragraph format.
I thought it can be done using regex_replace but I'm not sure how to write it. Anyone know?
Do not parse HTML files using regex, use HTML parser (HTML::TreeBuilder or something similar that can do in line changes) module, or in this case, even better use XSLT transformations.

Only display one paragraph of text

You can set what the Facebook Share preview says. I would like it to be the first paragraph of my movable type entry. The people who make entries sometimes use
<p>
tags or they use the rich editor which puts in two
<br /><br />
tags to separate paragraphs.
Is there a way I can have movable type detect when the first paragraph end and only display the first paragraph? I would like to add that to my entry template so it will add some information to my head.
EntryBody has a lot of attributes to help format the output of the tag. You can use those to change the content so it shows up correctly in HTML, JavaScript, PHP, XML or other forms of output.
If you understand how to use regular expressions, you can use that and an additional language, say PHP, to break the body up into an array and only output the first paragraph or element of the array.
The simplest thing, though, I would think, would be to do something like
<mt:EntryBody words=100>
That will cut off the entry body after the first 100 words. You could also require users to upload an excerpt with the entry and use the entry excerpt for Facebook, instead.

Decode HTML from XML with NewLine

First I parse XML and retrieve this:
<p><strong>Berns Salonger - the City's
The I decode it with MWFeedParser (stringByDecodingHTMLEntities) and retrieve this:
<p><strong>Berns Salonger - the City's Ideal Meeting Place
Note that this is only one line of many many lines which includes alot of tags.
Then I replace with \n and the console writes out the text with new lines. Everything is great except that all the other HTML tags is still there.
So I then run stringByConvertingHTMLToPlainText and all HTML tags dissapears. But also my replaced new lines.
How can I decode HTML without and at the same time replace with \n to print out a nice formatted text in a UITextView?
Instead of replacing <br> with \n, try replacing it with an HTML entity for newline:
. Then, when you call stringByConvertingHTMLToPlainText, it will convert the entity to an actual newline character.