Get text properties from PDF file - perl

How can I get text properties using PDF::API2 or CAM::PDF? I need font size and style info.
Something like (from CAM::PDF)
$pdf->getPageContent(1);
but with text info in it.

These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
UPDATE
Read abit more in http://search.cpan.org/dist/CAM-PDF/lib/CAM/PDF.pm
But there are methods like:
$self->getFontNames(pagenum)
And others which may prove helpful.

Related

How extract text geometry using PyPDF2?

I have pdf documents.
And it's clear to me how to extract text from it.
I need to extract not only text but also coordinates associated with this text.
It's my code:
from PyPDF2 import PdfReader
pdf_path = 'docs/doc_3.pdf'
pdf = PdfReader(pdf_path)
page_1_object = pdf.getPage(1)
page_1_object.extractText().split("\n")
The result is:
['Creating value for all stakeholders',
'Anglo\xa0American is re-imagining mining to improve people’s lives.']
I need geometries associated with extracted paragraphs.
Might be something like this for example:
['Creating value for all stakeholders', [1,2,3,4,]]
'Anglo\xa0American is re-imagining mining to improve people’s lives.', [7,8,9,10]]
How I can accomplish it?
Thanks,
Currently that ability is not a PyPDF2 feature, it has the ability for parsing the content as you show extractText() but does not hold the separate glyph xy positions nor output the lines coordinates.
There are other means in python to extract a single or multiple groups of letters that form words.
Using shell commands such as poppler from / in conjunction with a text "word" from PyPDF2 is possible, however the norm would be to run with another Py PDF Lib such as PyMuPDF and here is such an article, https://pyquestions.com/find-text-position-in-pdf-file for highlighting with PyMuPDF input.
The most common means to your goal is probably as described here How to extract text and text coordinates from a PDF file?

Copy formatted text to clipboard

A simple html page has FORMATTED text - not fancy - line breaks, and italic.
I want to have a button that takes this formatted text, and copies it to the clipboard, formatted (it is planned to be pasted into some LibreOffice document later).
Couldn't find how to do it.
I tried ZeroClipboard, and a suggestion to parse the text, replacing ""-s to "\r\n". That indeed does the trick for line breaks, but what about italic?... Any means to get this functionality?...
When you create an italic tag the responsible for formatting the document and showing the text properly is the browser. If you want to copy the text you should get the text already parse and render by the browser or parse yourself the text like you did with break lines. For italics when you find a ... tag you must create the adequate text. That is, text in italics, but that depends on the language you are using, but i'm sure it can be done.
OK,
Turns out that ZeroClipboard had this functionality (of rendering the HTML text upon paste), but have disabled it.
However, the version that supports it can be found at: https://github.com/botcheddevil/ZeroClipboard
Note:
You may find that in this version, creating the client, binding the flash to a component, and handling the events are rather different than the documentation of current version of ZeroClipboard (https://github.com/zeroclipboard/zeroclipboard/blob/master/docs/instructions.md

Formatting PHP code for Epub in MS Word

I'm trying to format the PHP code sections of a 700+ page book for Epub conversion. If I use soft returns at the end of the code lines, they get eaten. If I use hard returns (making each line a paragraph), I either get too much space between the lines, or not enough before and after the code section. If I add an empty line before and after the code section, it gets eaten.
There are thousands of lines of code in the book. Is there some way to handle this without manually editing the html file?
Is there some common format for these code section like being wrapped with PHP tags?
If they are PHP tags you can use this, which will wrap each tag set with a :
function fixPHPcode($matches)
{
return '<p class="php_code">' . $matches[0] . '</p>';
}
$data = preg_replace_callback('/<\?php(.|\s)+?\?>/i', 'fixPHPcode', $data);
I did try some fairly complex regex transforms, but I've found an easier method that actually works fairly well.
The secret is to create a style based on Word's "HTML Preformatted" style, or if you don't have that style, a style based on Normal that specifies Arial Unicode MS or Courier New as the font, with no proofing, left justified.
Indent with spaces and use soft returns (shift-enter) at the end of each line.
Calibre will produce acceptable Epub and Mobi versions of this. Courier is a crap font for code, but at least it's monospaced so the indents will line up, and people are used to seeing it as a code font.

How to parse .pdf files in Perl?

How to parse .pdf files in Perl?
Is perl is more efficient or should I use any other language?
When I want to extract text from a PDF, I feed it to pdftohtml (part of Poppler) using the -xml output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).
The XML format is fairly simple. You get a <page> element for each page in the PDF, which contains <fontspec> elements describing the fonts used and a <text> element for each line of text. The <text> elements may contain <b> and <i> tags for bold and italic text (which is why XML::Simple can't parse it properly).
You do need to use the top and left attributes of the <text> tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).
I personally use CAM::PDF.
my $doc=CAM::PDF->new($fileName) || die "$CAM::PDF::errStr\n"; CAM::PDF>asciify(/$pdfString);`
Pdfs are not designed for parsing, but for display/printing - thus anything is always try and error and it is quite possible that it is impossible to parse if everything is graphics. A good indicator is if you can copy and paste the content from the pdf into an editor. If this works, then you are in business.
Look at the CPAN and, specifically, if you want to do OCR, see PDF::OCR2
I don't know of any module that parses, that is, if you to extract the text from them. There are a number of modules that let you manipulate them. Try PDF::API2.

How can I extract text from a PDF file in Perl?

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).
I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.
pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.
There is getpdftext.pl; part of CAM::PDF.
Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].
James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.
If on windows go here and download xpdf precompiled binary:
http://www.foolabs.com/xpdf/download.html
Then, if you need to run this within perl use system, e.g.,:
system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");
where $saveName is the full path to your PDF file.
This hopefully leaves you with a text file you can open and parse in perl.
i tried this module which is working fine for special characters of pdf..
!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;
my $filename = "pdf.pdf";
my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";
Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.