I have a pdf, consists only of text, with no special characters nor images etc.
Is there any Perl module out there (Been looking at cpan to no avail) to help me parse each page line by line?
(Converting the PDF to text yields bad results and unparsable data)
Thanks,
When I want to extract text from a PDF, I feed it to pdftohtml (part of Poppler) using the -xml output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).
The XML format is fairly simple. You get a <page> element for each page in the PDF, which contains <fontspec> elements describing the fonts used and a <text> element for each line of text. The <text> elements may contain <b> and <i> tags for bold and italic text (which is why XML::Simple can't parse it properly).
You do need to use the top and left attributes of the <text> tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).
Related
I am working on a piece of code(scala) which renders documents like PDF, word etc. using XSL-FO.
At a point in implementation, a new line is introduced using \n. During debugging , I see that resulting xml string is as:
<fo:inline xmlns:fo="http://www.w3.org/1999/XSL/Format">
</fo:inline>
But this is respected only for PDFs, and for word documents no new line is introduced.
How to get it working for word too?
White-space characters generally collapse and linefeeds are generally ignored.
To preserve linefeeds, add either wrap="pre" (see https://www.w3.org/TR/xsl11/#white-space) or linefeed-treatment="preserve" (see https://www.w3.org/TR/xsl11/#linefeed-treatment).
wrap is a shorthand that sets values for linefeed-treatment, white-space-collapse, white-space-treatment, and wrap-option.
Your other option is to generate <fo:block />, as #kevin-brown suggested (and as I only just saw).
In a mapping editor, the display is correct after the legacy to unicode conversion for DEVANAGARI text shown using a unicode font (Arial Unicode MS). However, in MS-WORD, the display isn't as expected for the same unicode text in the unicode font (Arial Unicode MS) or any other Devanagari unicode fonts. The expected sequence of unicodes are provided as per the documentation. The sequence can be seen on the left-hand side table.
Please let me know where I am going wrong.
Thanks for your help!
Does your map have to insert the zero_width_joiner? The halant (virama) by itself is enough to get the half-consonant (for some combinations) and in particular, it may be that Word is using the presence of the ZWJ to keep them separate.
If getting rid of the ZWJ doesn't help, another possibility is that Word may be treating the individual characters of the text string as individual "runs" of text.
If those first 4 characters are not in a single run, this can happen.
[aside: the way to tell if it's being treated as a single run, is to save the document as an xml file and then open it with something like notepad++ and look at the xml "w:t" element (IIRC) associated with these characters. If they're all in separate w:t elements, it means they're in separate runs. In that case, you might need to copy the text from Word to some other tool (e.g. Notepad++) and then copy it from there and paste it back in Word -- that might cause it to be imported into Word in a single run.
I have a D7 site with CKEDitor installed, a Text Format that allows <p> tags and has "convert line breaks into HTML" selected, and I'm importing a csv utf-8 file made from an excel speadsheet that had some cells with several "paragraphs" in them. I guess for semantic sake, these are just line breaks. I can see the text broken up into what look like paragraphs in the csv.
I want this text to be paragraphs, though. When I do the import and look at a node I created, it looks fine and I can inspect the text and see that <p>'s wrap the paragraphs. But if I go to edit the node, in my CKEditor I see that all the paragraph text in now one big paragraph. How can I get all the paragraphs to show?
In feed importer module you have the option to change the filtered html type.It filters all html tags inside the content.
How to parse .pdf files in Perl?
Is perl is more efficient or should I use any other language?
When I want to extract text from a PDF, I feed it to pdftohtml (part of Poppler) using the -xml output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).
The XML format is fairly simple. You get a <page> element for each page in the PDF, which contains <fontspec> elements describing the fonts used and a <text> element for each line of text. The <text> elements may contain <b> and <i> tags for bold and italic text (which is why XML::Simple can't parse it properly).
You do need to use the top and left attributes of the <text> tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).
I personally use CAM::PDF.
my $doc=CAM::PDF->new($fileName) || die "$CAM::PDF::errStr\n"; CAM::PDF>asciify(/$pdfString);`
Pdfs are not designed for parsing, but for display/printing - thus anything is always try and error and it is quite possible that it is impossible to parse if everything is graphics. A good indicator is if you can copy and paste the content from the pdf into an editor. If this works, then you are in business.
Look at the CPAN and, specifically, if you want to do OCR, see PDF::OCR2
I don't know of any module that parses, that is, if you to extract the text from them. There are a number of modules that let you manipulate them. Try PDF::API2.
You can set what the Facebook Share preview says. I would like it to be the first paragraph of my movable type entry. The people who make entries sometimes use
<p>
tags or they use the rich editor which puts in two
<br /><br />
tags to separate paragraphs.
Is there a way I can have movable type detect when the first paragraph end and only display the first paragraph? I would like to add that to my entry template so it will add some information to my head.
EntryBody has a lot of attributes to help format the output of the tag. You can use those to change the content so it shows up correctly in HTML, JavaScript, PHP, XML or other forms of output.
If you understand how to use regular expressions, you can use that and an additional language, say PHP, to break the body up into an array and only output the first paragraph or element of the array.
The simplest thing, though, I would think, would be to do something like
<mt:EntryBody words=100>
That will cut off the entry body after the first 100 words. You could also require users to upload an excerpt with the entry and use the entry excerpt for Facebook, instead.