How can I extract text from a PDF file in Perl? - perl

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?

These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).

I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.
pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.

There is getpdftext.pl; part of CAM::PDF.

Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].

James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.
If on windows go here and download xpdf precompiled binary:
http://www.foolabs.com/xpdf/download.html
Then, if you need to run this within perl use system, e.g.,:
system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");
where $saveName is the full path to your PDF file.
This hopefully leaves you with a text file you can open and parse in perl.

i tried this module which is working fine for special characters of pdf..
!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;
my $filename = "pdf.pdf";
my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";

Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.

Related

What are the characters shown on a file after forcefully changing the extension?

Recently I changed the extension of an .apk file to .txt and despite this, I was able to open it on Notepad with some random characters, that weren't available on the keyboard in the file. org/antlr/runtime/ANTLRFileStream.class…TmOÓP=w[×QËÀ)ê|A…ÑETÔ¢NP¢™ãË—º•Q3ZÓcüþ¿j",£ß4ñGÏmÇñ˽Ïs{žçœçeûùëóW ±¨á0F5d0ÖA˔‹LÈã’ŠËR˜PqEƒ†Iy\•ØkÒºÞÁЂ´¦TL«˜H­95{ÙÚ°2K/­×–Y³Üªù(ð·:%œv\'¸!Гû÷óðª#¢èUܵä¸öòæÆÛ_±^ÔÂt^Ùª­Z¾#ýæc"XwêKž_5-7¨ù¦¿éΆmÞZ^Y*ÍS “ÛÖ¹µ¹7eûUàxn]%µ‘Ð^TÊvË^…kžUˆ;u_àTw<sÁ}µDL%ÛªØ>ùÄš#º…Rø˜¨;o)\,0ǚԞ݇ؓ‡àΪ<ò6ýr³¥GsÃ횪EOÌ_…É =è•Ç¬Ž#8ª£½ú^fùõ˜Ž›¸%pü IT{`Á2þ¶<Š:î`NÇ<î긇A˜èÿïˆ8Ç0Q¥»¨#- Ze7srRÉšíVƒõÐ]0rí&tÀ”O´‡[Y±K ö¬H›¯Ü %÷¬8Ì) r+åšW·ÑÏF†¿,bd—i%h³­ˆá8½YÄiª‘
Not just this, but while converting many other extensions like .jar,.xapk, etc. would show me these characters.Can anyone please explain, what factors are these characters based on, and how does the OS decides or try what characters to show in an unsupported file exactly.
Is there a way to get the original content through this data?
Lets say you created a text editor, which can write and save text files as well as open text files. you also defined the encoding that will be used to save text in binary files(all files when saved are binary). So your encoding looks something as following:
Your encoding Emacs encoding
TEXT BINARY TEXT BINARY
A 01000001 ă 01000001
B 01000010 Ћ 01000010
... ...
Z 01011010 Ϡ 01011010
lets say you create a file with 'ABZ' as its contents. this file when saved contains value 010000010100001001011010. When you open this file with your text editor, the editor finds 010000010100001001011010 as file contents and using above encoding it knows that its 'ABZ' hence it prints 'ABZ' on the screen.
Now lets say you open same file using emacs, since emacs uses its own encoding it displays "ăЋϠ", There is nothing wrong with emacs. it just doesn't know that data was written using your custom encoding.
So the point is that every file is written in a specific format, for example APK format can only be correctly understood by Android system. when you try to open the APK file in a text editor it just tries to make sense of binary data in the same way as emacs does in above example.
Is there a way to get the original content through this data?
If you know the originally encoding using which data was written, then you can read the contents of file using same encoding.

Get text properties from PDF file

How can I get text properties using PDF::API2 or CAM::PDF? I need font size and style info.
Something like (from CAM::PDF)
$pdf->getPageContent(1);
but with text info in it.
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
UPDATE
Read abit more in http://search.cpan.org/dist/CAM-PDF/lib/CAM/PDF.pm
But there are methods like:
$self->getFontNames(pagenum)
And others which may prove helpful.

Fetching Asian (Japanese / Chinese) characters from Excel file into TSV format using Perl's Spreadsheet::ParseExcel

Friends,
I am preparing a TSV file from excel file, containing Chinese (special) characters as follows - The Seonjeongneung ... Jeonghyeon (貞顯王后, 1462–1530) .....
I have tried using perl CPAN's Spreadsheet::ParseExcel and Spreadsheet::ParseExcel::FmtJapan. But no success. These characters are appearing as ?? in the TSV file, when opened in VIM.
I also tried " binmode STDOUT, ':utf8'; " and " binmode STDOUT, ':encoding(cp932)'; "
Please help me out, finding a way to extract information from Excel sheets and getting into TSV format.
PS : Excel allows direct save as TSV, but the output was screwed up there as well
I just exported your sample text perfectly from OpenOffice Calc, just by choosing the "Save as .csv" option and choosing UTF-8 as format. I'd be very surprised if Excel can't do the same. Have you considered the possibility that VIM / your console doesn't support Chinese characters correctly or that it's set to use a font that doesn't include Chinese characters? To check for this kind of error, open your .csv or .tsv file in your web browser. Web browsers will do anything to correctly display a file, including changing fonts as necessary.
If you want, send me the file you need to export and I'll check if there's anything weird about it. Could be one of the native Chinese encodings (gb or big5) instead of Unicode.

Why am I unable to parse non-proportional text using CAM::PDF?

While parsing page no. 22 of http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vxfs_admin.pdf, I am able to parse all the words except mount_vxfs as its encoding style and/or font is different than normal plain text.
Please find attached PDF Page for details.
Please find my code :-
`#!/usr/bin/perl
use CAM::PDF;
my $file_name="vxfs_admin_51sp1_lin.pdf";
my $pdf = CAM::PDF ->new($file_name);
my $no_pages=$pdf->numPages();
print "$no_pages\n";
for(my $i=1;$i<$no_pages;$i++){
my $page = $pdf->getPageText($i);
//for page no. 22
//if($i==22){
print $page;
//}
}`
PDF doesn't store the semantic text that you read but rather uses character codes which map to glyphs (the painted characters) in a particular font. Often, however, the code-glyph mapping matches common character sets (such as ISO-8859-1 or UTF-8) so that the codes are human-readable. That's the case for all of the text you have been able to parse, although sometimes the odd character, mostly punctuation, is also "wrong".
The text for "mount_vxfs" in your document is encoded completely differently, unfortunately, resulting in apparent garbage. If you're curious, you can see what's really there by substituting getPageText() with getPageContent() in your code.
In order to convert the PDF text back to meaningful characters, PDF readers have to jump through hoops with a number of conversion tables (including the so-called CMaps). Because this is a lot of programming work, many simpler libraries opt not to implement them. That's the case with CAM::PDF.
If you're just interested in parsing the text (not editing it), the following technique is something I use with success:
Obtain xpdf (http://foolabs.com/xpdf) or Poppler (http://poppler.freedesktop.org/). Poppler is a newer fork of xpdf. If you're using *nix, there will be a package available.
Use the command-line tool 'pdftotext' to extract the text from a file, either page-wise or all at once.
Example:
#!/usr/bin/perl
use English;
my $file_name="vxfs_admin.pdf";
open my $text_fh, "/usr/bin/pdftotext -layout -q '$file_name' - 2>/dev/null |";
local $INPUT_RECORD_SEPARATOR = "\f"; # slurp a whole page at a time
while (my $page_text = <$text_fh>) {
# this is here only for demo purposes
print $page_text if $INPUT_LINE_NUMBER == 19;
}
close $text_fh;
(Note: The document I retrieved using your link is slightly different; the offending bit is on page 19 instead.)

How to parse .pdf files in Perl?

How to parse .pdf files in Perl?
Is perl is more efficient or should I use any other language?
When I want to extract text from a PDF, I feed it to pdftohtml (part of Poppler) using the -xml output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).
The XML format is fairly simple. You get a <page> element for each page in the PDF, which contains <fontspec> elements describing the fonts used and a <text> element for each line of text. The <text> elements may contain <b> and <i> tags for bold and italic text (which is why XML::Simple can't parse it properly).
You do need to use the top and left attributes of the <text> tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).
I personally use CAM::PDF.
my $doc=CAM::PDF->new($fileName) || die "$CAM::PDF::errStr\n"; CAM::PDF>asciify(/$pdfString);`
Pdfs are not designed for parsing, but for display/printing - thus anything is always try and error and it is quite possible that it is impossible to parse if everything is graphics. A good indicator is if you can copy and paste the content from the pdf into an editor. If this works, then you are in business.
Look at the CPAN and, specifically, if you want to do OCR, see PDF::OCR2
I don't know of any module that parses, that is, if you to extract the text from them. There are a number of modules that let you manipulate them. Try PDF::API2.