How to recognize PDF watermark and remove it using PDFBox - metadata

I'm trying to extract text except watermark text from PDF files with Apache PDFBox library,so I want to remove the watermark first and the rest is what I want.but unfortunately,Both PDmetadata and PDXObject can't recognize the watermark,any help will be appreciated.I found some code below.
// Open PDF document
PDDocument document = null;
try {
document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
} catch (IOException e) {
e.printStackTrace();
}
// Get all pages and loop through them
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while( iter.hasNext() ) {
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = null;
// Get all Images on page
try {
images = resources.getImages();//How to specify watermark instead of images??
} catch (IOException e) {
e.printStackTrace();
}
if( images != null ) {
// Check all images for metadata
Iterator imageIter = images.keySet().iterator();
while( imageIter.hasNext() ) {
String key = (String)imageIter.next();
PDXObjectImage image = (PDXObjectImage)images.get( key );
PDMetadata metadata = image.getMetadata();
System.out.println("Found a image: Analyzing for Metadata");
if (metadata == null) {
System.out.println("No Metadata found for this image.");
} else {
InputStream xmlInputStream = null;
try {
xmlInputStream = metadata.createInputStream();
} catch (IOException e) {
e.printStackTrace();
}
try {
System.out.println("--------------------------------------------------------------------------------");
String mystring = convertStreamToString(xmlInputStream);
System.out.println(mystring);
} catch (IOException e) {
e.printStackTrace();
}
}
// Export the images
String name = getUniqueFileName( key, image.getSuffix() );
System.out.println( "Writing image:" + name );
try {
image.write2file( name );
} catch (IOException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
}
System.out.println("--------------------------------------------------------------------------------");
}
}
}

In contrast to your assumption there is nothing like an explicit watermark object in a PDF to recognize watermarks in generic PDFs.
Watermarks can be applied to a PDF page in many ways; each PDF creating library or application has its own way to add watermarks, some even offer multiple ways.
Watermarks can be
anything (Bitmap graphics, vector graphics, text, ...) drawn early in the content and, therefore, forming a background on which the rest of the content is drawn;
anything (Bitmap graphics, vector graphics, text, ...) drawn late in the content with transparency, forming a transparent overlay;
anything (Bitmap graphics, vector graphics, text, ...) drawn in the content stream of a watermark annotation which shall be used to represent graphics that shall be printed at a fixed size and position on a page, regardless of the dimensions of the printed page (cf. section 12.5.6.22 of the PDF specification ISO 32000-1).
Some times even mixed forms are used, have a look at this answer for an example, at the bottom you find a 'watermark' drawn above graphics but beneath text (to allow for easy reading).
The latter choice (the watermark annotation) obviously is easy to remove, but it actually also is the least often used choice, most likely because it is so easy to remove; people applying watermarks generally don't want their watermarks to get lost. Furthermore, annotations are sometimes handled incorrectly by PDF viewers, and code copying page content often ignores annotations.
If you do not handle generic documents but a specific type of documents (all generated alike), on the other hand, the very manner in which the watermarks are applied in them, probably can be recognized and an extraction routine might be feasible. If you have such a use case, please share a sample PDF for inspection.

Related

iText7(PDFSweep) throw OutofMemory Error when redact a large PDF

I was trying to remove some contents from PDF using PDFSweep, below are part of my code, I am using the CompositeCleanupStrategy and adding RegexBasedCleanupStrategy to the strategy:
CompositeCleanupStrategy strategy = new CompositeCleanupStrategy();
for (int i = 0; i < keywordlist.size(); i++) {
String kvalue = keywordlist.get(i);
Loger.getLogger().info("keyword " + i + "=" + kvalue);
strategy.add(new RegexBasedCleanupStrategy(kvalue).setRedactionColor(ColorConstants.GRAY));
}
try {
PdfWriter writer = new PdfWriter(dest);
writer.setCompressionLevel(0);
PdfDocument pdf = new PdfDocument(new PdfReader(src), writer);
// sweep
PdfAutoSweep pdfAutoSweep = new PdfAutoSweep(strategy);
pdfAutoSweep.cleanUp(pdf);
// close the document
pdf.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
When stragety is small, like there is only one or two , the cleanup is working fine, howere if there are 243 in the keywordlist, the PDF size is about 70 MB, I got following error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
at java.lang.String.toLowerCase(String.java:2590)
at java.lang.String.toLowerCase(String.java:2670)
at com.itextpdf.io.font.PdfEncodings.convertToString(PdfEncodings.java:287)
at com.itextpdf.kernel.pdf.PdfString.toUnicodeString(PdfString.java:163)
at com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo.getUnscaledBaselineWithOffset(TextRenderInfo.java:425)
at com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo.getBaseline(TextRenderInfo.java:213)
at com.itextpdf.kernel.pdf.canvas.parser.listener.CharacterRenderInfo.<init>(CharacterRenderInfo.java:112)
at com.itextpdf.kernel.pdf.canvas.parser.listener.RegexBasedLocationExtractionStrategy.toCRI(RegexBasedLocationExtractionStrategy.java:156)
at com.itextpdf.kernel.pdf.canvas.parser.listener.RegexBasedLocationExtractionStrategy.eventOccurred(RegexBasedLocationExtractionStrategy.java:135)
at com.itextpdf.pdfcleanup.autosweep.CompositeCleanupStrategy.eventOccurred(CompositeCleanupStrategy.java:115)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.eventOccurred(PdfCanvasProcessor.java:534)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.displayPdfString(PdfCanvasProcessor.java:549)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.access$4700(PdfCanvasProcessor.java:108)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor$ShowTextArrayOperator.invoke(PdfCanvasProcessor.java:617)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.invokeOperator(PdfCanvasProcessor.java:452)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processContent(PdfCanvasProcessor.java:281)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processPageContent(PdfCanvasProcessor.java:302)
at com.itextpdf.kernel.pdf.canvas.parser.PdfDocumentContentParser.processContent(PdfDocumentContentParser.java:77)
at com.itextpdf.kernel.pdf.canvas.parser.PdfDocumentContentParser.processContent(PdfDocumentContentParser.java:90)
at com.itextpdf.pdfcleanup.autosweep.PdfAutoSweep.getPdfCleanUpLocations(PdfAutoSweep.java:130)
at com.itextpdf.pdfcleanup.autosweep.PdfAutoSweep.cleanUp(PdfAutoSweep.java:186)
(Full disclosure: original author of RegexBasedCleanupStrategy here)
RegexBasedCleanupStrategy is not meant to be used like this.
You are creating 200 instances of this class, all of which will go over the document to see whether they can match (chunk by chunk) the PDF against the regular expression.
In order to do this, they will store all chunks in the document, sort them, and then loop over them.
So you are duplicating the document 200-something times in memory.
That is your bottleneck.
My suggestion: build a better regular expression.
You can obviously match keyword a, b, c etc with regex
(a)|(b)|(c)
This would copy the document in memory only once, and then attempt to match the aggregate regex against it.
It has both performance, and memory-footprint benefits.

itext: how to tweak text extraction?

I'm using iText 5.5.8 for Java.
Following the default, straightforward text extraction procedures, i.e.
PdfTextExtractor.getTextFromPage(reader, pageNumber)
I was surprised to find several mistakes in the output, specifically all letter ds come out as os.
So how does text extraction in iText really work? Is it some kind of OCR?
I took a look under the hood, trying to grasp how TextExtractionStrategy works, but I couldn't figure out much. SimpleTextExtractionStrategy for example seems to just determine the presence of lines and spaces, whereas it's TextRenderInfo that provides text by invoking some decode method on a GraphicsState's font field and that's as far as I could go without getting a major migraine.
So who's my man? Which class should I override or which parameter should I tweak to be able to tell iText "hey, you're reading all ds wrong!"
edit:
sample PDF can be found at http://www.fpozzi.com/stampastopper/download/ name of file is 0116_LR.pdf
Sorry, can't share a direct link.
This is some basic code for text extraction
import java.io.File;
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
try
{
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
System.out.println("----------------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
edit after #blagae and #mkl answers
Before starting to fiddle with iText I have tried text extraction from Apache PDFBox (a project similar to iText I just discoreved) but it does have the same issue.
Understanding how these programs treat text is way beyond my dedication, so I have written a simple method to extract text from raw page content, that is whatever stands between BT and ET markers.
import java.io.File;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
private final static Pattern actualWordPattern = Pattern.compile("\\((.*?)\\)");
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
Matcher matcher;
String line, extractedText;
boolean anyMatchFound;
try
{
for (int i = 1; i <= 16; i++)
{
byte[] contentBytes = ContentByteUtils.getContentBytesForPage(reader, i);
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytes));
while ((line = raf.readLine()) != null && !line.equals("BT"));
extractedText = "";
while ((line = raf.readLine()) != null && !line.equals("ET"))
{
anyMatchFound = false;
matcher = actualWordPattern.matcher(line);
while (matcher.find())
{
anyMatchFound = true;
extractedText += matcher.group(1);
}
if (anyMatchFound)
extractedText += "\n";
}
System.out.println(extractedText);
System.out.println("+++++++++++++++++++++++++++");
String properlyExtractedText = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(properlyExtractedText);
System.out.println("---------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
It appears, at least in my case, that characters are correct. However the order of words or even letters is messy, super messy in fact, so this approach is unusable either.
What really surprises me is that all methods I have tried so far to retrieve text from PDFs, including copy/paste from Adobe Reader, screw something up.
I have come to the conclusion that the most reliable way to get some decent text extraction may also be the most unexpected: some good OCR.
I am now trying to:
1) transform pdf into an image (PDFBox is great at doing that - do not even bother to try pdf-renderer)
2) OCR that image
I will post my results in a few days.
Your input document has been created in a strange (but 'legal') way. There is a Unicode mapping in the resources that maps arbitrary glyphs to Unicode points. In particular, character number 0x64, d in ASCII, is mapped to the glyph with Unicode point 0x6f (UTF-8), which is o, in this font. This is not a problem per se - any PDF viewer can handle it - but it is strange, because all other glyphs that are used are not "cross-mapped". e.g. character 0x63 is mapped to Unicode point 0x63 (which is c), etc.
Now for the reason that Acrobat does the text extraction correctly (except for the space), and the others go wrong. We'll have to delve into the PDF syntax for this:
[p, -17.9, e, -15.1, l, 1.4, l, 8.4, i, -20, m, 5.8, i, 14, st, -17.5, e, 31.2, ,, -20.1, a] TJ
<</ActualText <fffffffeffffffff00640064> >> BDC
5.102 0 Td
[d, -14.2, d] TJ
EMC
That tells a PDF viewer to print p-e-l-l-i- -m-i-st-e- -a on the first line of code, and d-d after that on the fourth line. However, d maps to o, which is apparently only a problem for text extraction. Acrobat does do the text extraction correctly, because there is a content marker /ActualText which says that whatever we write between the BDC and EMC markers must be parsed as dd (0x64,0x64).
So to answer your question: iText does this on the same level as a lot of well-respected viewers, which all ignore the /ActualText marker. Except for Acrobat, which does respect it and overrules the ToUnicode mapping.
And to really answer your question: iText is currently looking into parsing the /ActualText marker, but it will probably take a while before it gets into an official release.
This probably has to do with how the PDF with OCR'd in the first place, rather than with how iTextSharp is parsing the PDF's contents. Try copy/pasting the text from the PDF into Notepad, and see if the "ds -> os" transformation still occurs. If this is the case, you're going to have to do the following when parsing text from this particular PDF:
Identify all occurrences of the string "os".
Decide whether or not the word of which the given "os" instance is a constituent is a valid English/German/Spanish/ word.
If it IS a valid word, do nothing.
If it is NOT a valid word, perform the reverse "os -> ds" transformation, and check against the dictionary in the language of your choice again.

How to store and compare annotation (with Gold Standard) in GATE

I am very comfortable with UIMA, but my new work require me to use GATE
So, I started learning GATE. My question is regarding how to calculate performance of my tagging engines (java based).
With UIMA, I generally dump all my system annotation into a xmi file and, then using a Java code compare that with a human annotated (gold standard) annotations to calculate Precision/Recall and F-score.
But, I am still struggling to find something similar with GATE.
After going through Gate Annotation-Diff and other info on that page, I can feel there has to be an easy way to do it in JAVA. But, I am not able to figure out how to do it using JAVA. Thought to put this question here, someone might have already figured this out.
How to store system annotation into a xmi or any format file programmatically.
How to create one time gold standard data (i.e. human annotated data) for performance calculation.
Let me know if you need more specific or details.
This code seems helpful in writing the annotations to a xml file.
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/BatchProcessApp.java
String docXMLString = null;
// if we want to just write out specific annotation types, we must
// extract the annotations into a Set
if(annotTypesToWrite != null) {
// Create a temporary Set to hold the annotations we wish to write out
Set annotationsToWrite = new HashSet();
// we only extract annotations from the default (unnamed) AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
Iterator annotTypesIt = annotTypesToWrite.iterator();
while(annotTypesIt.hasNext()) {
// extract all the annotations of each requested type and add them to
// the temporary set
AnnotationSet annotsOfThisType =
defaultAnnots.get((String)annotTypesIt.next());
if(annotsOfThisType != null) {
annotationsToWrite.addAll(annotsOfThisType);
}
}
// create the XML string using these annotations
docXMLString = doc.toXml(annotationsToWrite);
}
// otherwise, just write out the whole document as GateXML
else {
docXMLString = doc.toXml();
}
// Release the document, as it is no longer needed
Factory.deleteResource(doc);
// output the XML to <inputFile>.out.xml
String outputFileName = docFile.getName() + ".out.xml";
File outputFile = new File(docFile.getParentFile(), outputFileName);
// Write output files using the same encoding as the original
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter out;
if(encoding == null) {
out = new OutputStreamWriter(bos);
}
else {
out = new OutputStreamWriter(bos, encoding);
}
out.write(docXMLString);
out.close();
System.out.println("done");

PdfUtilities.convertPdf2Png Create automatic images in My directory

I've written some code to perform OCR on a PDF using Tesseract (Tess4J):
public void DoOCRAnalyse(String From) throws FileNotFoundException {
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
File[] files=PdfUtilities.convertPdf2Png(new File(From));
for (File f:files) {
try {
String result = instance.doOCR(f);
/*String result = instance.doOCR(take File or BufferedImage); */
SearchForSVHC(result,SvhcList);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
It recognizes text, which is great, but my problem is that it needs the images to be in a directory on disk. How can I pass a BufferedImage or File to the methode doOCR() without needing the files on disk?
You are passing a File object to doOCR. When you call convertPdf2Png, it invokes GhostScript to convert a PDF file to one or more PNG files. You certainly can delete them after OCR if you want, e.g., by executing f.Delete() in a finally block.

EncodedImage.getEncodedImageResource fail to load image with the same name different subfolder in Eclipse (Blackberry plugin)

I'm using the Blackberry JDE Plugin v1.3 for Eclipse and I'm trying this code to create a BitmapField and I've always done it this way:
this.bitmap = EncodedImage.getEncodedImageResource("ico_01.png");
this.bitmap = this.bitmap.scaleImage32(
this.conf.getWidthScale(), this.conf.getHeightScale());
this.imagenLoad = new BitmapField(this.bitmap.getBitmap(), this.style);
It works fine with no errors, but now I have this set of images with the same name but in different subfolders like this:
I made it smaller than it actually is for explanatory reasons. I wouldn't want to rename the files so they're all different. I would like to know how to access the different subfolders. "res/img/on/ico_01.jpg", "img/on/ico_01.jpg", "on/ico_01.jpg" are some examples that I tried and failed.
It appears that EncodedImage.getEncodedImageResource(filename) will retrieve the first instance of filename regardless of where it is in your resource directory tree.
This is not very helpful if you have the images with the same filename in different directories (as you have).
The solution I have used is to create my own method which can return an image based on a path and filename.
public static Bitmap getBitmapFromResource(String resourceFilename){
Bitmap imageBitmap = null;
//get the image as a byte stream
InputStream imageStream = getInstance().getClass().getResourceAsStream(resourceFilename);
//load it into memory
byte imageBytes[];
try {
imageBytes = IOUtilities.streamToBytes(imageStream);
//create the bitmap
imageBitmap = Bitmap.createBitmapFromBytes(imageBytes, 0, imageBytes.length, 1);
} catch (IOException e) {
Logger.log("Error loading: "+resourceFilename+". "+e.getMessage());
}
return imageBitmap;
}