Tika detects Tesseract but doesn't perform any OCR - tesseract

I just installed Tika from the Github's repository and tried to OCR a PDF which contains scanned document pages.
java -cp tika-app/target/tika-app-1.17-SNAPSHOT.jar org.apache.tika.cli.TikaCLI /tmp/testing/sample_scanned.pdf
However, only metadata gets extracted (although I got confirmation beforehand that Tesseract is installed and utilized:
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
(Full output)
Note: Regular PDFs (containing) plain text gets extract successfully. The problem seems to be the OCR process itself.
This has been tested on Centos as well as Ubuntu - same issue.
Do I need to make changes to config files, specify more parsers? What could cause this?
Thank you.

Turns out PDF image extraction is disabled by default. From PDFParserConfig:
Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors. Set to true with caution. The default is false.
A simple example to enable it that worked for me:
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
ParseContext parseContext = new ParseContext();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
parseContext.set(PDFParserConfig.class, pdfConfig);
try (InputStream stream = ClasspathUtil.readStreamFromClasspath("test.pdf")) {
parser.parse(stream, handler, new Metadata(), parseContext);
System.out.println(handler.toString());
}

Related

Why does DITA Open Toolkit PDF plugin rename image href attributes?

I'm sorry if this doesn't have enough information. I don't typically ask for help online like this.
I'm using DITA Open Toolkit 3.4 on Windows. I generated a plugin called "vcr2" using Jarno's (very excellent and helpful) PDF Plugin Generator and then made a handful of customizations. The plugin uses the pdf2 plugin as a base. When I try to use the vcr2 plugin, my images are not working. I've tracked the problem down to malformed image filenames in the image's href attribute.
For example:
In my source file (a DITA Task), the markup for one of my images looks like this:
<image href="MyRemindersChooseReminder.png"/>
If I run a transform with the pdf2 plugin, the images work fine. In the merged stage1.xml file in the Temp folder, the XML for that same image looks like this:
<image class="- topic/image " href="df2d132af27436c59c5c8c4282e112d62bec8201.png" placement="inline" xtrc="image:1;10:66" xtrf="file:/V:/Vasont/Extract/t12340879-minimal/t12340879.xml"/>
It is processed into a file Topic.fo, and looks like this:
<fo:external-graphic
 src="url('file:/V:/Vasont/Extract/t12340879-minimal/MyRemindersChooseReminder.png')"/>
Everything works fine and the image looks fine.
If I run the same file through my 'vcr2' plugin, which just calls the same pdf2 plugin with some overrides, all the images get broken:
stage1.xml
<image class="- topic/image " href="df2d132af27436c59c5c8c4282e112d62bec8201.png" placement="inline" xtrc="image:1;10:66" xtrf="file:/V:/Vasont/Extract/t12340879-minimal/t12340879.xml"/>
Topic.fo
<fo:external-graphic
 src="url('file:/V:/Vasont/Extract/t12340879-minimal/df2d132af27436c59c5c8c4282e112d62bec8201.png')"
/>
As I track this down further, it appears that somewhere in the map-reader Ant task, this filename gets changed to that cryptic string of pseudo-hexadecimal. I think later on it's supposed to be changed back or resolved to a complete URI or something.
So, the two-part question is: Why does Open Toolkit change my filenames, and what's supposed to change them back?
DITA-OT's preprocess uses hashes for temporary filenames because it allows the code to not deal with directory structures. This enables preprocess to work in so-called "map-first" mode, where it first processes all DITA map resources and only then starts to process DITA topic and image resources.
The preprocess has a step called clean-preprocess that can rewrite the temporary file names to match source resource files names. However, this rewrite operation is disabled for PDF output because the original file names are not used for anything in that output type.

Can OpenXML be used to launch a new Word instance?

I'm able to generate Word documents without issue. I save the resulting *.docx file to a temporary location and then need to launch the file in Word.
The requirement is to not "open" the file in Word (easily done with a Process.Start) but to have load into Word as a new unsaved file. This is because certain propriety integrations for Word need to take over when a user saves the file and don't kick in if the file is ready saved but to a location on disk.
I've achieved this by using Interop calls to the Word application, adding the new document to Word's workspace. My problem is with Interop which tends to break on various client machines, particularly when Office upgrades take place (say a client had 32-bit office but upgraded with a 64-bit version).
I'm somewhat new to OpenXML, but can it be used to automate Word or is Interop my only real option?
object oFilename = tmpFileName;
object oNewTemplate = false;
object oDocumentType = 0;
object oVisible = true;
Document document = _application.Documents.Add(ref oFilename, ref oNewTemplate, ref oDocumentType, ref oVisible);
No, the Open XML technology has no way of interacting with the Office (Word) application - it's for file creation/manipulation, only. The interop is required in order to do anything with the Word application.
There is sort of a way around this - and it's only possible with Word, no other Office application has this - is to convert the Open XML content to the OPC flat-file format. This "concatenates" the various packages that make up the zip file to a pure text string, essetially a single XML file.
XML content in the OPC flat-file format can then be written to an already opened (even newly created) Word document using the Range.InsertXML method via "the interop". In a way, this "streams" the Open XML content into the opened Word document.
The problem with this approach is that certain document-level properties are not written to the target document, so not all aspects of the opened document can be changed. For example: page size, orientation, headers, footers... So if this kind of thing also needs to be affected the interop is required for such settings.

WebSphere MQ binary fiiles

This might be a question that may not be answered due to the nature of the external tool I am using (lack of documentation).
Basically, I am using a tool that pushes and pulls messages from the queue, more precisely - it pushes and pulls files. It worked perfectly for text files but when I tried pushing and then pulling a binary file - the pulled one was corrupted, it's size increased in comparsion with the original file (1.33 ratio).
For example moving a zip file wouldn't work...
I suppose it has something to do with the tools configuration, the only settings that can be changed regarding the problem are CCSID and encoding (UTF-8, Base16, etc.), I tried playing with both, unfortunately without success.
Tried using the following CCSIDs: 65535, 1208, 819
and encodings : UTF-8, Base16, Base64
In every case the binary file was corrupted after pulling it from the queue, I'm not entirely sure how the tool acomplishes that, it's written in Java, also I'm new to MQ so I tried searching for the correct options in IBM's docs but I haven't found anything that makes more sense than 65535 and Base16, yet it still doesn't work, could anyone with more experience with MQ tell if playing with these options makes sense at all in this case and if so - suggest what CCSID and encoding can I try to accomplish what Ive described above?
More information is really needed, but my suspicion is you are putting the message on the queue as a text message and playing around with encodings and ccsid's to try to get it right. You really need to know how the 'Java' app achieves this - is it using JMS (eg JMSBytesMessage) or base Java (something like setMessageData).
At a high level, there is a header on a message (The MD) which 'describes' the data - the MD format field. If you say the data is a string then MQ can convert between codepages should the getter request it etc. Put a tiny binary file into a message onto a queue, and browse the queue with amqsbcg or the GUI - what are the MD fields for format? What headers are on the payload - anything like RFH2's?
Put the same code in to give us a clue, or at least the amqsbcg output

Does GemBox support column filters?

If I load an xlsx file that already has a filter, and then save the file using GemBox, it seems to throw my filtering cell away. Does GemBox support filtering at all? I know I can load the file in preserved mode but my intention is to create the filter in my C# app.
EDIT 2015-06-15:
The newer version of GemBox.Spreadsheet (version 3.9) does have an API support for filters, see the version history page. Also, you can find here an Excel AutoFilter example.
ORIGINAL ANSWER:
Unfortunately, in the current version (3.5) GemBox.Spreadsheet supports filtering only through preservation.

Photoshop plugin Pipl resources and disabling save

I have a photoshop plugin for a file format Ive written in c++ that loads and opens the images, however I do not have code to save the image in the same format
Using SimpleFormat sample plugin as a base I have the following code:
FormatFlags { fmtSavesImageResources,
fmtCanRead,
fmtCanWrite,
fmtCanWriteIfRead,
fmtCanWriteTransparency,
fmtCanCreateThumbnail },
However removing fmtCanWrite or IfRead etc produces parser errors in the Pipl tool, I've checked the syntax and it should be correct but I cannot figure out how to do this =s
This is really counter-intuitive but if you check out pg 77 of Plug-in Resource Guide.pdf from the SDK the flags aren't really flags, they are actually keywords. Based on the grammar they give, to not include the write flag you actually need to replace it with a do-not-write flag.
For example, this compiles fine for me:
FormatFlags { fmtDoesNotSavesImageResources,
fmtCanRead,
fmtCannotWrite,
fmtCanWriteIfRead,
fmtCanWriteTransparency,
fmtCanCreateThumbnail }