How to use PDFTextExtractor on iTextSharp - itext

I want to retrieve the text from a pdf file using iTextSharp. However, I wasn't able to use PDFTextExtractor as in JAVA library of itextsharp(itext). I need readPDFOffline class to return content of file. I will give the pseudo below for you to understand well what I want.
private string readPDFOffline(string fileUri);
read PDF;
retrieve Text Content of This Pdf;*
save content into string contentOfflineFile;
return contentOfflineFile;
I would like to do the * part of Code

PdfTextExtractor is present in the most recent releases of iTextSharp, available here.
Retrieving text in PDF is not easy. Not impossible, but there are times when the only thing that will work is OCR. For all other cases, PdfTextExtractor should work. Cases of it not working are considered bugs and should be reported as such.
Be aware that there are several cases where what looks like valid text is not extractable:
Text with no encoding... just glyph indexes. OCR time.
"Text" that is just raw paths. Horribly inefficient, and time for more OCR.
"Text" that is pixels in a bitmap. OCR once more.
OCR: Optical Character Recognition. There's even a reasonably good one for free available on Google Code, though I don't recall the name off the top of my head.

Related

keep/copy XMP with libexif

I try to add a thumbnail to a JPEG picture using libexif.
For now I'm borrowing the code from exif (the command line tool that is shipped by the libexif team).
However I noticed the XMP tags get deleted from the metadata. There is an old bugreport here.
I tried to see how to achieve this anyway with libexif but I don't really understand how to get the XMP from input file and put it in the output file. I just want to copy all XMP data, I don't need to extract anything of it.
I saw there is a TAG EXIF_TAG_XML_PACKET in exif_tag.h but couldn't figure out how to read/write this tag.
A related solution is in this SO answer but it looks complicated. I'm not familiar coding in C.
Is it actually possible to keep all XMP when using only libexif API? Have things changed in recent years on that? How would you write this in code?
Thanks
I believe it should be somewhat straightforward. XMP fields are described in the ISO/Adobe standard. Regular Kotlin/Java/Android file I/O and some string manipulation should be all that is required.
I would start out by becoming intimately familiar with ISO 16684-1:2019. Then, write a method for your jpeg file class that grabs all the XMP fields. Store those fields in a temp file (to prevent difficult to recover data loss in the event of your code or libexif crashing). Hand the file off to libexif. Generate the thumbnail. Finally, when that's done you can restore the XMP fields. If the thumbnail is stored in an XMP field as well (and it sounds like it is), it may be easier to concatenate that field with the other ones which were already grabbed, updating the temp file so that it contains EVERY XMP field, before adding all of the XMP fields back to the jpeg.
Unfortunately, I do not currently have the time to read a 50 page ISO standard, synthesize the information, and then write the code to implement the solution. Here's a link to the standard at least, to get you started.
https://www.iso.org/obp/ui/#iso:std:iso:16684:-1:ed-2:v1:en

PDF Parsing with SWIFT

I want to parse a PDF that has no images, only text. I'm trying to find pieces of text. For example to search the string "Name:" and be able to read the characters after ":".
I'm already able to open a PDF, get the number of pages, and to loop on them. The problem is when I want to use functions like CGPDFDictionaryGetStream or CGPDFStreamCopyData, because they use pointers. I have not found much info on the internet for swift programmers.
Maybe the easiest way would be to parse all the content to an NSString. Then I could do the rest.
Here my code:
// Get existing Pdf reference
let pdf = CGPDFDocumentCreateWithURL(NSURL(fileURLWithPath: path))
let pageCount = CGPDFDocumentGetNumberOfPages(pdf);
for index in 1...pageCount {
let myPage = CGPDFDocumentGetPage(pdf, index)
//Search somehow the string "Name:" to get whats written next
}
You can use PDFKit to do this. It is part of the Quartz framework and is available on both iOS and MacOS. It is also pretty fast, I was able to search through a PDF with over 15000 characters in just 0.07s.
Here is an example:
import Quartz
let pdf = PDFDocument(url: URL(fileURLWithPath: "/Users/...some path.../test.pdf"))
guard let contents = pdf?.string else {
print("could not get string from pdf: \(String(describing: pdf))")
exit(1)
}
let footNote = contents.components(separatedBy: "FOOT NOTE: ")[1] // get all the text after the first foot note
print(footNote.components(separatedBy: "\n")[0]) // print the first line of that text
// Output: "The operating system being written in C resulted in a more portable software."
You can also still access most of (if not all of) the properties you had before. Such as pdf.pageCount for the number of pages, and pdf.page(at: <Int>) to get a specific page.
This is a pretty intensive task. There are libs like PDFKitten which are not maintained anymore. Here is a port of PDFKitten to swift that i did, with some modifications to the way the string searching / content indexing is done, as well as support for truetype fonts.
https://github.com/SimpleApp/PDFParser
[disclaimer : lib author]
[second disclaimer: this lib is 100% mit open sourced. The library has nothing to do with the company, it's not an ad or even a product, i'm posting this comment to help people, and then maybe grow a community around it, because it's a very common requirement and nothing free works well enough]
EDIT : the reason it's a pretty intensive task (not to mention all the character encoding issues), is that the PDF format doesn't have the notion of a "line of text" or even a "word". All it has is character printing instruction. Which means that if you want to find a "word", you'll have to recompute the frame of every blocks of character, using font information, and find the ones can be coalesced into a single word.
That's the reason why you won't find a lot of libraries doing those kind of features, and even some big project fail sometimes at providing correct copy/paste or text search features.

Unicode and PDF tooltips / messageboxes with iTextSharp

I need to find a way to add a quite long string in a quite small space in a PDF document.
I am using iTextSharp. I have already tried adding comment annotations (balloons) with PdfAnnotation.CreateText() and I didn't like the way they looked/worked. It made the page too heavy (I had many comments per page) and their behavior was odd in many ways (thank Adobe for that).
Now I was thinking of adding some simple tooltips on 'chunks' in the page or popping-up messageboxes with javascript (like illustrated here : http://www.codehacker.com/ITEXTSHARP/chap15.aspx#). To my great disappointment however, it seems that Acrobat (?) doesn't support Unicode characters in those situations. E.g. I do this:
var javascript = new PdfAnnotation(
w, 200f, 550f, 300f, 650f,
PdfAction.JavaScript("app.alert('" + "Αρνάκι άσπρο και παχύ!" + "');\r", w));
chunk.SetAnnotation(javascript);
...and, in the best case, a messagebox with gibberish pops up when the user clicks on the chunk.
Is there any setting for making Unicode acceptable for the code above or another way to do what I want?
EDIT:
I have now seen this: https://stackoverflow.com/a/163065/964053
and I've tried modifying my code like that:
var javascript = new PdfAnnotation(
w, 200f, 550f, 300f, 650f,
PdfAction.JavaScript((char)0xFEFF + "app.alert('" + "Αρνάκι άσπρο και παχύ!" + "');\r", w));
chunk.SetAnnotation(javascript);
But nothing seems to change...
EDIT2 :
Using octal representation e.g. (\141) doesn't seem to help either...
EDIT3 :
This seems to work nice until you double clink on it, but I need to make the tooltip size itself based on the contents size:
var lToolTip = PdfFormField.CreatePopup(
w, new Rectangle(tc.Left, tc.Bottom, tc.Right, tc.Top), val, true);
chunk.SetAnnotation(lToolTip);
The rectangle provided doesn't seem to be used in any way...
Any ideas?
I don't know what PdfFormField.CreatePopup() is supposed to create, but I see a small mark on my page that displays a popup when you hover the mouse over it.
I'm kind of lost in your edits, it's not clear what works for you and what doesn't, but regarding the unicode problem in JavaScript: are you aware that there are two versions of the javaScript() method?
See javaScript(java.lang.String, com.itextpdf.text.pdf.PdfWriter, boolean)
If you add the boolean value true, the JavaScript string should be interpreted as Unicode. If this doesn't solve your problem, I'll delete this answer, and if you clarify your question (cutting away the irrelevant parts), I'll do another attempt.

How do I use IPTC/EXIF metadata to categorise photos?

Many photo viewing and editing applications allow you to examine and change EXIF and IPTC data in JPEG and other image files. For example, I can see things like shutter speed, aperture and orientation in the picture files that come off my Canon A430. There are many, many name/value pairs in all this metadata. But...
What do I do if I want to store some data that doesn't have a build-in field name. Let's say I'm photographing an athletics competition and I want to tag every photo with the competitor's bib number. Can I create a "bib_number" field and assign it a values of "0001", "5478", "8124" etc, and then search for all photos with bib_number="5478"?
I've spent a few hours searching and the best I can come up with is to put this custom information in the "keywords" field but this isn't quite what I'm after. With this socution I'd have to craft a query like "keywords contains bib_number_5478" whereas what I want it "bib_number is 5478".
So do the EXIF and/or IPTC standards allow addtional user-defined field names?
Thanks
Kev
It can be used for that, but it really shouldn't: it's meant to be user-editable and so isn't a safe place to put critical metadata. Using an XMP sidecar is better for this kind of thing: in XMP, any field added that a given app does not understand is, according to the standard, supposed to be ignored by that app and not destroyed.
I don't know if there are applications to do this but by the standards described for JPEG files there is a field called Comments where you can assign values that could act like tags.
C# code:
using System.Windows.Media.Imaging;
using System.IO;
...
FileStream fs = new FileStream(#"<img_path>", FileMode.Open, FileAccess.ReadWrite);
BitmapMetadata bmd = (BitmapMetadata)BitmapFrame.Create(fs).Metadata;
bmd.Comment = "Some Comment Here";
also if you are looking for an application that already has this functionality built into it, then might i recommend Irfan View (open pic, go to Image menu, click on Comments button).
Hope this helps.

storing & reading xml files iPhone

I am using text-editor to store xml files.
I know how to read xml files in iPhone application. But the problem which do i get is explained below.
When I store xml files through text-editors, it looks perfect.
But when iPhone - xCode debugs, xml file data is shown as below.
What kind of mistake have I Done?
{\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf460
{\fonttbl\f0\fmodern\fcharset0 Courier-Bold;}
{\colortbl;\red255\green255\blue255;}
\margl1440\margr1440\vieww9000\viewh8400\viewkind0
\pard\tx480\tx960\tx1440\tx1920\tx2400\tx2880\tx3360\tx3840\tx4320\tx4800\tx5280\tx5760\tx6240\tx6720\tx7200\tx7680\tx8160\tx8640\tx9120\tx9600\tx10080\tx10560\tx11040\tx11520\tx12000\tx12480\tx12960\tx13440\tx13920\tx14400\tx14880\tx15360\tx15840\tx16320\tx16800\tx17280\tx17760\tx18240\tx18720\tx19200\tx19680\tx20160\tx20640\tx21120\tx21600\tx22080\tx22560\tx23040\tx23520\tx24000\tx24480\tx24960\tx25440\tx25920\tx26400\tx26880\tx27360\tx27840\tx28320\tx28800\tx29280\tx29760\tx30240\tx30720\tx31200\tx31680\tx32160\tx32640\tx33120\tx33600\tx34080\tx34560\tx35040\tx35520\tx36000\tx36480\tx36960\tx37440\tx37920\tx38400\tx38880\tx39360\tx39840\tx40320\tx40800\tx41280\tx41760\tx42240\tx42720\tx43200\tx43680\tx44160\tx44640\tx45120\tx45600\tx46080\tx46560\tx47040\tx47520\tx48000\ql\qnatural\pardirnatural
\f0\b\fs24 \cf0 \CocoaLigature0 \
Play Your ShotEvery golfer likes to hit the ball long, high and straight, but experience tells you that you can expect a certain pattern to your shots.Shots Curve From Sidehill LiesSidehill fairway lies will cause the ball to curve, slice right or hook left. Here's a tip to help make better contact.Don't Ground The DriverKeep the clubhead just off the ground to get your swing off to a consistently smooth start.Tilt Your TeeHere's a tip to "max out" into-the-wind drives.}
Above given data isn't my xml file.
I tried to save my xml file through textEditor.
But it prefix something before my xml data.
What should be done by me to avoid this problem?
my actual xml data is following.
<?xml version="1.0" encoding="ISO-8859-1"?>\
<tips><Prop_Tips><Tip_ID><![CDATA[1]]></Tip_ID><Tip_Title>Play Your Shot</Tip_Title><Tip_Description>Every golfer likes to hit the ball long, high and straight, but experience tells you that you can expect a certain pattern to your shots.</Tip_Description></Prop_Tips><Prop_Tips><Tip_ID><![CDATA[2]]></Tip_ID><Tip_Title>Shots Curve From Sidehill Lies</Tip_Title><Tip_Description>Sidehill fairway lies will cause the ball to curve, slice right or hook left. Here's a tip to help make better contact.</Tip_Description></Prop_Tips><Prop_Tips><Tip_ID><![CDATA[3]]></Tip_ID><Tip_Title>Don't Ground The Driver</Tip_Title><Tip_Description>Keep the clubhead just off the ground to get your swing off to a consistently smooth start.</Tip_Description></Prop_Tips><Prop_Tips><Tip_ID><![CDATA[4]]></Tip_ID><Tip_Title>Tilt Your Tee</Tip_Title><Tip_Description>Here's a tip to "max out" into-the-wind drives.</Tip_Description></Prop_Tips></tips>
Seems like you have written your XML file in a Rich Text Editor. It is saving the file in .rtf format, instead of raw text/xml.
Maybe the editor you are using has the option to save as raw txt. That should solve it.
You could also create the file using XCode. It will surely use raw txt.
Are you editing your XML file with TextEdit or something? You've got a whole bunch of RTF data in there, which would seem to imply that you're overwriting your XML data with RTF'd XML. Try using a text editor like Property List Editor or TextWrangler instead.