Reasonable Tesseract OCR settings using Apache Tika…? - tesseract

I'm using Apache Tika to do text extraction and I have to handle scanned PDF images. So I'm trying Tesseract, but I'm having problems finding any good resource on good default settings…?
I'm also experiencing what seems like weird post-processing artifacts:
I get this:
"och ptensionskos nader"
from this image:
It really seems some post-processing has moved the t to the beginning of the word and left a blank instead. Seems super-weird to me why it would do this unless there's some very bad post-processing settings.
These are my basic settings from Apache Tika:
val pdfConfig: PDFParserConfig = {
val pdfConf = new PDFParserConfig()
pdfConf.setOcrDPI(150)
pdfConf.setDetectAngles(false)
pdfConf.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY)
pdfConf
}
val tesseractOCRConfig: TesseractOCRConfig = {
val tessConf = new TesseractOCRConfig()
tessConf.setLanguage("eng+swe")
tessConf.setEnableImageProcessing(1)
tessConf.setResize(100) // 100-900 - lower faster.
// tessConf.setApplyRotation(true)
tessConf
}
Any help highly appreciated!

It is also an important property in pdf config to skip/include internal images processing
pdfConf.setExtractInlineImages(true) //for the scanned pdf setting it to false has no sense
In the TesseractOCRConfig the usefil is also setTimeout()

Related

MailKit - How can bodybuilder help create a complex email body with many images interspersed with text?

For example, what if you need to create an email body like this:
Text ...
Image ...
Text ...
Image ...
Text
Here is one of the examples that works for one text and one image:
var builder = new BodyBuilder ();
var pathImage = Path.Combine (Misc.GetPathOfExecutingAssembly (), "Image.png");
var image = builder.LinkedResources.Add (pathLogoFile);
image.ContentId = MimeUtils.GenerateMessageId ();
builder.HtmlBody = string.Format (#"<p>Hey!</p><img src=""cid:{0}"">", image.ContentId);
message.Body = builder.ToMessageBody ();
Can we do something like builder.HtmlBody += to just keep adding more and more texts and images?
The BodyBuilder class is designed to constructing typically message structures, not the type of thing you are doing.
You will need to construct your message manually, not using BodyBuilder.
After quite a bit of trial/error/testing, I discovered that you can indeed keep adding text and images to the HtmlBody object, as my question speculated, by using builder.HtmlBody +=
In response to the increasingly widespread use of TLS instead of SSL, and therefore the need to abandon the use of Microsoft's obsolete SmtpClient component, I have developed a comprehensive emailing test component, in Visual Basic, using the wonderful MailKit from JStedfast.
As my question suggested, I wanted to give my users the ability to compose a handsome email body using text interspersed with images as needed. If any VB developers would like to benefit from this work, just let me know.
#jstedfast - I just saw your answer after posting this. For my production version, I need to add images from a blob field in a SQLServer table. I intend to use the manual method, as you stated, to do that. But for images I was loading into my sample program, I was able to make a fairly complex email body using src=file for each image, and adding them with builder.HtmlBody +=

Stop huge error output from testing-library

I love testing-library, have used it a lot in a React project, and I'm trying to use it in an Angular project now - but I've always struggled with the enormous error output, including the HTML text of the render. Not only is this not usually helpful (I couldn't find an element, here's the HTML where it isn't); but it gets truncated, often before the interesting line if you're running in debug mode.
I simply added it as a library alongside the standard Angular Karma+Jasmine setup.
I'm sure you could say the components I'm testing are too large if the HTML output causes my console window to spool for ages, but I have a lot of integration tests in Protractor, and they are SO SLOW :(.
I would say the best solution would be to use the configure method and pass a custom function for getElementError which does what you want.
You can read about configuration here: https://testing-library.com/docs/dom-testing-library/api-configuration
An example of this might look like:
configure({
getElementError: (message: string, container) => {
const error = new Error(message);
error.name = 'TestingLibraryElementError';
error.stack = null;
return error;
},
});
You can then put this in any single test file or use Jest's setupFiles or setupFilesAfterEnv config options to have it run globally.
I am assuming you running jest with rtl in your project.
I personally wouldn't turn it off as it's there to help us, but everyone has a way so if you have your reasons, then fair enough.
1. If you want to disable errors for a specific test, you can mock the console.error.
it('disable error example', () => {
const errorObject = console.error; //store the state of the object
console.error = jest.fn(); // mock the object
// code
//assertion (expect)
console.error = errorObject; // assign it back so you can use it in the next test
});
2. If you want to silence it for all the test, you could use the jest --silent CLI option. Check the docs
The above might even disable the DOM printing that is done by rtl, I am not sure as I haven't tried this, but if you look at the docs I linked, it says
"Prevent tests from printing messages through the console."
Now you almost certainly have everything disabled except the DOM recommendations if the above doesn't work. On that case you might look into react-testing-library's source code and find out what is used for those print statements. Is it a console.log? is it a console.warn? When you got that, just mock it out like option 1 above.
UPDATE
After some digging, I found out that all testing-library DOM printing is built on prettyDOM();
While prettyDOM() can't be disabled you can limit the number of lines to 0, and that would just give you the error message and three dots ... below the message.
Here is an example printout, I messed around with:
TestingLibraryElementError: Unable to find an element with the text: Hello ther. This could be because the text is broken up by multiple elements. In this case, you can provide a function for your text matcher to make your matcher more flexible.
...
All you need to do is to pass in an environment variable before executing your test suite, so for example with an npm script it would look like:
DEBUG_PRINT_LIMIT=0 npm run test
Here is the doc
UPDATE 2:
As per the OP's FR on github this can also be achieved without injecting in a global variable to limit the PrettyDOM line output (in case if it's used elsewhere). The getElementError config option need to be changed:
dom-testing-library/src/config.js
// called when getBy* queries fail. (message, container) => Error
getElementError(message, container) {
const error = new Error(
[message, prettyDOM(container)].filter(Boolean).join('\n\n'),
)
error.name = 'TestingLibraryElementError'
return error
},
The callstack can also be removed
You can change how the message is built by setting the DOM testing library message building function with config. In my Angular project I added this to test.js:
configure({
getElementError: (message: string, container) => {
const error = new Error(message);
error.name = 'TestingLibraryElementError';
error.stack = null;
return error;
},
});
This was answered here: https://github.com/testing-library/dom-testing-library/issues/773 by https://github.com/wyze.

How to customize addContentItemDialog to restrict files over 10mb upload in IBM Content Navigator

I am customizing ICN (IBM Content Navigator) 2.0.3 and my requirement is to restrict user to upload files over 10mb and only allowed files are .pdf or .docx.
I know I have to extend / customize the AddContentItemDialog but there is very less detail on exactly how to do it, or any video on it. I'd appreciate if someone could guide.
Thanks
I installed the development environment but I am not sure how to extend the AddContentItemDialog.
public void applicationInit(HttpServletRequest request,
PluginServiceCallbacks callbacks) throws Exception {
}
I want to also know how to roll out the changes to ICN.
This can be easily extended. I would suggest to read the ICN red book for the details on how to do it. But it is pretty standard code.
Regarding rollout the code to ICN, there are two ways:
- If you are using plugin: just replace the Jar file on the server location and restart WAS.
- If you are using EDS: you need to redeploy the web service and restart WAS.
Hope this helps.
thanks
Although there are many ways to do this, one way indeed is tot extend, or augment the AddContentItemDialog as you qouted. After looking at the (rather poor IBM documentation) i figured you could probably use the onAdd event/method
Dojo/Aspect#around allows you to do exactly that, example:
require(["dojo/aspect", "ecm/widget/dialog/AddContentItemDialog"], function(aspect, AddContentItemDialog) {
aspect.around(AddContentItemDialog.prototype, "onAdd", function advisor(original) {
return function around() {
var files = this.addContentItemGeneralPane.getFileInputFiles();
var containsInvalidFiles = dojo.some(files, function isInvalid(file) {
var fileName = file.name.toLowerCase();
var extensionOK = fileName.endsWith(".pdf") || fileName.endsWith(".docx");
var fileSizeOK = file.size <= 10 * 1024 * 1024;
return !(extensionOK && fileSizeOK);
});
if (containsInvalidFiles) {
alert("You can't add that :)");
}else{
original.apply(this, arguments);
}
}
});
});
Just make sure this code gets executed before the actual dialog is opened. The best way to achieve this, is by wrapping this code in a new plugin.
Now on creating/deploying plugins -> The easiest way is this wizard for Eclipse (see also a repackaged version for newer eclipse versions). Just create a new arbitrary plugin, and paste this javascript code in the generated .js file.
Additionally it might be good to note that you're only limiting "this specific dialog" to upload specific files. It would probably be a good idea to also create a requestFilter to limit all possible uses of the addContent api...

Scala way choosing configuration regarding different environment

My application needs to read configuration file either from the resource directory or from s3.
For local development, I need to read it from the local resource directory. So, when build my project, I don't put the configuration file config.properties into my application jar file. In this case, it should read the configuration from S3. When I can think of doing this scala is pretty much like what I do it by java
val stream : InputStream = getClass.getResourceAsStream("/config.properties")
if (stream != null) {
val lines = scala.io.Source.fromInputStream( stream ).getLines
} else {
/*read it from S3*/
}
But I think scala gives a more functional programing sytax. Any advice?
There are probably better ways to go about what you're after, but here's a more-or-less straight translation of the posted code.
val lines:Iterator[String] = Option(getClass.getResourceAsStream("/config.properties"))
.fold{/*read from S3
return Iterator*/}(io.Source.fromInputStream(_).getLines)

Downloadable xml files in Play Framework

I am a Scala/PlayFramework noob here, so please be easy on me :).
I am trying to create an action (serving a GET request) so that when I enter the url in the browser, the browser should download the file. So far I have this:
def sepaCreditXml() = Action {
val data: SepaCreditTransfer = invoiceService.sepaCredit()
val content: HtmlFormat.Appendable = views.html.sepacredittransfer(data)
Ok(content)
}
What it does is basically show the XML in the browser (whereas I actually want it to download the file). Also, I have two problems with it:
I am not sure if using Play's templating "views.html..." is the best idea to create an XML template. Is it good/simple enough or should I use a different solution for this?
I have found Ok.sendFile in the Play's documentation. But it needs a java.io.File. I don't know how to create a File from HtmlFormat.Appendable. I would prefer to create a file in-memory, i.e. no new File("/tmp/temporary.xml").
EDIT: Here SepaCreditTransfer is a case class holding some data. Nothing special.
I think it's quite normal for browsers to visualize XML instead of downloading it. Have you tried to use the application/force-download content type header, like this?
def sepaCreditXml() = Action {
val data: SepaCreditTransfer = invoiceService.sepaCredit()
val content: HtmlFormat.Appendable = views.html.sepacredittransfer(data)
Ok(content).withHeaders("Content-Type" -> "application/force-download")
}