simple example for tesseractengine3 .net wrapper - tesseract

I'm trying to do some simple OCR-Tasks and I'm still searching for a free library. Since everybody seems to use tesseract, can someone provide me a simple but working example of using tesseractengine3.dll with C# or VB.NET, please? After searching several hours I am not able to find any documentation or an example which compiles under VS2010 and .Net 4.

Try this
Ocr ocr = new Ocr();
private void button1_Click(object sender, EventArgs e)
{
using (Bitmap bmp = new Bitmap(#"C:\OCR\ocr-test.jpg"))
{
tessnet2.Tesseract tessocr = new tessnet2.Tesseract();
tessocr.Init(null, "eng", false);
tessocr.GetThresholdedImage(bmp, Rectangle.Empty).Save("c:\\temp\\" + Guid.NewGuid().ToString() + ".bmp");
// Tessdata directory must be in the directory than this exe
Console.WriteLine("Multithread version");
ocr.DoOCRMultiThred(bmp, "eng");
Console.WriteLine("Normal version");
ocr.DoOCRNormal(bmp, "eng");
}
}
public class Ocr
{
public void DumpResult(List<tessnet2.Word> result)
{
foreach (tessnet2.Word word in result)
Console.WriteLine("{0} : {1}", word.Confidence, word.Text);
}
public List<tessnet2.Word> DoOCRNormal(Bitmap image, string lang)
{
tessnet2.Tesseract ocr = new tessnet2.Tesseract();
ocr.Init(null, lang, false);
List<tessnet2.Word> result = ocr.DoOCR(image, Rectangle.Empty);
DumpResult(result);
return result;
}
ManualResetEvent m_event;
public void DoOCRMultiThred(Bitmap image, string lang)
{
tessnet2.Tesseract ocr = new tessnet2.Tesseract();
ocr.Init(null, lang, false);
// If the OcrDone delegate is not null then this'll be the multithreaded version
ocr.OcrDone = new tessnet2.Tesseract.OcrDoneHandler(Finished);
// For event to work, must use the multithreaded version
ocr.ProgressEvent += new tessnet2.Tesseract.ProgressHandler(ocr_ProgressEvent);
m_event = new ManualResetEvent(false);
ocr.DoOCR(image, Rectangle.Empty);
// Wait here it's finished
m_event.WaitOne();
}
public void Finished(List<tessnet2.Word> result)
{
DumpResult(result);
m_event.Set();
}
void ocr_ProgressEvent(int percent)
{
Console.WriteLine("{0}% progression", percent);
}
}

Try use https://github.com/charlesw/tesseract library that used in http://vietocr.sourceforge.net/ awesome open source OCR and for simple example look at BaseApiTester project in source code of library.

There is a .NET wrapper for Tesseract 3.01: tesseract-ocr-dotnet

Related

ITextSharp / PDFBox text extract fails for certain pdfs

The code below extracts the text from a PDF correctly via ITextSharp in many instances.
using (var pdfReader = new PdfReader(filename))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var currentText = PdfTextExtractor.GetTextFromPage(
pdfReader,
1,
strategy);
currentText =
Encoding.UTF8.GetString(Encoding.Convert(
Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
Console.WriteLine(currentText);
}
However, in the case of this PDF I get the following instead of text: "\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\a\u0001\u0002\u0003\u0004\u0005\u0006\u0003"
I have tried different encodings and even PDFBox but still failed to decode the PDF correctly. Any ideas on how to solve the issue?
Extracting the text nonetheless
#Bruno's answer is the answer one should give here, the PDF clearly does not provide the information required to allow proper text extraction according to section 9.10 Extraction of Text Content of the PDF specification ISO 32000-1...
But there actually is a slightly evil way to extract the text from the PDF at hand nonetheless!
Wrapping one's text extraction strategy in an instance of the following class, the garbled text is replaced by the correct text:
public class RemappingExtractionFilter : ITextExtractionStrategy
{
ITextExtractionStrategy strategy;
System.Reflection.FieldInfo stringField;
public RemappingExtractionFilter(ITextExtractionStrategy strategy)
{
this.strategy = strategy;
this.stringField = typeof(TextRenderInfo).GetField("text", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
}
public void RenderText(TextRenderInfo renderInfo)
{
DocumentFont font =renderInfo.GetFont();
PdfDictionary dict = font.FontDictionary;
PdfDictionary encoding = dict.GetAsDict(PdfName.ENCODING);
PdfArray diffs = encoding.GetAsArray(PdfName.DIFFERENCES);
;
StringBuilder builder = new StringBuilder();
foreach (byte b in renderInfo.PdfString.GetBytes())
{
PdfName name = diffs.GetAsName((char)b);
String s = name.ToString().Substring(2);
int i = Convert.ToInt32(s, 16);
builder.Append((char)i);
}
stringField.SetValue(renderInfo, builder.ToString());
strategy.RenderText(renderInfo);
}
public void BeginTextBlock()
{
strategy.BeginTextBlock();
}
public void EndTextBlock()
{
strategy.EndTextBlock();
}
public void RenderImage(ImageRenderInfo renderInfo)
{
strategy.RenderImage(renderInfo);
}
public String GetResultantText()
{
return strategy.GetResultantText();
}
}
It can be used like this:
ITextExtractionStrategy strategy = new RemappingExtractionFilter(new LocationTextExtractionStrategy());
string text = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
Beware, I had to use System.Reflection to access private members. Some environments may forbid this.
The same in Java
I initially coded this in Java for iText because that's my primary development environment. Thus, here the initial Java version:
public class RemappingExtractionFilter implements TextExtractionStrategy
{
public RemappingExtractionFilter(TextExtractionStrategy strategy) throws NoSuchFieldException, SecurityException
{
this.strategy = strategy;
this.stringField = TextRenderInfo.class.getDeclaredField("text");
this.stringField.setAccessible(true);
}
#Override
public void renderText(TextRenderInfo renderInfo)
{
DocumentFont font =renderInfo.getFont();
PdfDictionary dict = font.getFontDictionary();
PdfDictionary encoding = dict.getAsDict(PdfName.ENCODING);
PdfArray diffs = encoding.getAsArray(PdfName.DIFFERENCES);
;
StringBuilder builder = new StringBuilder();
for (byte b : renderInfo.getPdfString().getBytes())
{
PdfName name = diffs.getAsName((char)b);
String s = name.toString().substring(2);
int i = Integer.parseUnsignedInt(s, 16);
builder.append((char)i);
}
try
{
stringField.set(renderInfo, builder.toString());
}
catch (IllegalArgumentException | IllegalAccessException e)
{
e.printStackTrace();
}
strategy.renderText(renderInfo);
}
#Override
public void beginTextBlock()
{
strategy.beginTextBlock();
}
#Override
public void endTextBlock()
{
strategy.endTextBlock();
}
#Override
public void renderImage(ImageRenderInfo renderInfo)
{
strategy.renderImage(renderInfo);
}
#Override
public String getResultantText()
{
return strategy.getResultantText();
}
final TextExtractionStrategy strategy;
final Field stringField;
}
(RemappingExtractionFilter.java)
It can be used like this:
String extractRemapped(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException
{
TextExtractionStrategy strategy = new RemappingExtractionFilter(new LocationTextExtractionStrategy());
return PdfTextExtractor.getTextFromPage(reader, pageNo, strategy);
}
(from RemappedExtraction.java)
Why does this work?
First of all, this is not the solution to all extraction problems, merely for extracting text from PDFs like the OP has presented.
This method works because the names the PDF uses in its fonts' encoding differences arrays can be interpreted even though they are not standard. These names are built as /Gxx where xx is the hexadecimal representation of the ASCII code of the character this name represents.
A good test to find out whether or not a PDF allows text to be extracted correctly, is by opening it in Adobe Reader and to copy and paste the text.
For instance: I copied the word ABSTRACT and I pasted it in Notepad++:
Do you see the word ABSTRACT in Notepad++? No, you see %&SOH'"%GS. The A is represented as %, the B is represented as &, and so on.
This is a clear indication that the content of the PDF isn't accessible: there is no mapping between the encoding that was use (% = A, & = B,...) and the actual characters that humans can understand.
In short: the PDF doesn't allow you to extract text, not with iText, not with iTextSharp, not with PDFBox. You'll have to find an OCR tool instead and OCR the complete document.
For more info, you may want to watch the following videos:
https://www.youtube.com/watch?v=4ur9WRWVrbM (~5 minutes)
https://www.youtube.com/watch?v=wxGEEv7ibHE (~15 minutes)
https://www.youtube.com/watch?v=g-QcU9B4qMc (~45 minutes)

Eclipse 4: disable native alt-f4 behavior

does anyone know if it is possible to disable or overwrite the native behavior of "alt+f4" (on windows closes the application) in an e4 application?
what is suggested solution to achieve this?
best regards
My solution is NOT pure SWT solution. It only works on Windows. But you mentioned Windows, and if you only target one platform this is good enough. It uses internal code from SWT, but it maps to Windows API, documented by Microsoft, so it will not change.
public static void main(String[] args) {
final Display display = new Display();
final Shell shell = new Shell(display);
shell.addListener(SWT.Close, new Listener() {
#Override
public void handleEvent(Event event) {
if (OS.GetKeyState(OS.VK_MENU) < 0 && OS.GetKeyState(OS.VK_F4) < 0) {
event.doit = false;
}
}
});
shell.open();
while (!shell.isDisposed()) {
if (!display.readAndDispatch()) {
display.sleep();
}
}
display.dispose();
}
found a solution but i am not to happy with that one.
created a addon:
that is registering an event handler on the UIEvents.UILifeCycle.APP_STARTUP_COMPLETE topic.
then somehow retrieve the shell from the topics metadata and registering a filter on the display.
#PostConstruct
void hookListeners() {
eventHandler = new EventHandler() {
#Override
public void handleEvent(Event arg0) {
MElementContainer property = (MElementContainer) arg0.getProperty("org.eclipse.e4.data");
final Shell shell = (Shell) property.getSelectedElement().getWidget();
final Display display = shell.getDisplay();
display.addFilter(SWT.Close, new Listener() {
#Override
public void handleEvent(org.eclipse.swt.widgets.Event event) {
if (!MessageDialog.openQuestion(shell, "Exit",
"Do you really want to close the Application?")) {
//see api documentation display.addFilter(
event.type = SWT.NONE;
event.doit = false;
}
}
});
}
};
eventBroker.subscribe(UIEvents.UILifeCycle.APP_STARTUP_COMPLETE, eventHandler);
}
this solution does not seem to correct to me so if anyone has a better one please share it :-)

GWT FileUpload - Servlet options and handling response

I am new to GWT and am trying to implement a file upload functionality.
Found some implementation help over the internet and used that as reference.
But have some questions related to that:
The actual upload or writing the contents of file on server(or disk) will be done by a servlet.
Is it necessary that this servlet (say MyFileUploadServlet) extends HttpServlet? OR
I can use RemoteServiceServlet or implement any other interface? If yes, which method do I need to implement/override?
In my servlet, after everything is done, I need to return back the response back to the client.
I think form.addSubmitCompleteHandler() can be used to achieve that. From servlet, I could return text/html (or String type object) and then use SubmitCompleteEvent.getResults() to get the result.
Question is that can I use my custom object instead of String (lets say MyFileUploadResult), populate the results in it and then pass it back to client?
or can I get back JSON object?
Currently, after getting back the response and using SubmitCompleteEvent.getResults(), I am getting some HTML tags added to the actual response such as :
pre> Image upload successfully /pre> .
Is there a way to get rid of that?
Thanks a lot in advance!
Regards,
Ashish
To upload files, I have extended HttpServlet in the past. I used it together with Commons-FileUpload.
I made a general widget for form-based uploads. That was to accommodate uploads for different file types (plain text and Base64). If you just need to upload plain text files, you could combine the following two classes into one.
public class UploadFile extends Composite {
#UiField FormPanel uploadForm;
#UiField FileUpload fileUpload;
#UiField Button uploadButton;
interface Binder extends UiBinder<Widget, UploadFile> {}
public UploadFile() {
initWidget(GWT.<Binder> create(Binder.class).createAndBindUi(this));
fileUpload.setName("fileUpload");
uploadForm.setEncoding(FormPanel.ENCODING_MULTIPART);
uploadForm.setMethod(FormPanel.METHOD_POST);
uploadForm.addSubmitHandler(new SubmitHandler() {
#Override
public void onSubmit(SubmitEvent event) {
if ("".equals(fileUpload.getFilename())) {
Window.alert("No file selected");
event.cancel();
}
}
});
uploadButton.addClickHandler(new ClickHandler() {
#Override
public void onClick(ClickEvent event) {
uploadForm.submit();
}
});
}
public HandlerRegistration addCompletedCallback(
final AsyncCallback<String> callback) {
return uploadForm.addSubmitCompleteHandler(new SubmitCompleteHandler() {
#Override
public void onSubmitComplete(SubmitCompleteEvent event) {
callback.onSuccess(event.getResults());
}
});
}
}
The UiBinder part is pretty straighforward.
<g:HTMLPanel>
<g:HorizontalPanel>
<g:FormPanel ui:field="uploadForm">
<g:FileUpload ui:field="fileUpload"></g:FileUpload>
</g:FormPanel>
<g:Button ui:field="uploadButton">Upload File</g:Button>
</g:HorizontalPanel>
</g:HTMLPanel>
Now you can extend this class for plain text files. Just make sure your web.xml serves the HttpServlet at /textupload.
public class UploadFileAsText extends UploadFile {
public UploadFileAsText() {
uploadForm.setAction(GWT.getModuleBaseURL() + "textupload");
}
}
The servlet for plain text files goes on the server side. It returns the contents of the uploaded file to the client. Make sure to install the jar for FileUpload from Apache Commons somewhere on your classpath.
public class TextFileUploadServiceImpl extends HttpServlet {
private static final long serialVersionUID = 1L;
#Override
protected void doPost(HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
if (! ServletFileUpload.isMultipartContent(request)) {
response.sendError(HttpServletResponse.SC_BAD_REQUEST,
"Not a multipart request");
return;
}
ServletFileUpload upload = new ServletFileUpload(); // from Commons
try {
FileItemIterator iter = upload.getItemIterator(request);
if (iter.hasNext()) {
FileItemStream fileItem = iter.next();
// String name = fileItem.getFieldName(); // file name, if you need it
ServletOutputStream out = response.getOutputStream();
response.setBufferSize(32768);
int bufSize = response.getBufferSize();
byte[] buffer = new byte[bufSize];
InputStream in = fileItem.openStream();
BufferedInputStream bis = new BufferedInputStream(in, bufSize);
long length = 0;
int bytes;
while ((bytes = bis.read(buffer, 0, bufSize)) >= 0) {
out.write(buffer, 0, bytes);
length += bytes;
}
response.setContentType("text/html");
response.setContentLength(
(length > 0 && length <= Integer.MAX_VALUE) ? (int) length : 0);
bis.close();
in.close();
out.flush();
out.close();
}
} catch(Exception caught) {
throw new RuntimeException(caught);
}
}
}
I cannot recall how I got around the <pre></pre> tag problem. You may have to filter the tags on the client. The topic is also addressed here.

Eclipse Plugin, How can I know when IResourceDeltaVisitor ends processing tree nodes?

I wrote an Eclipse Plugin that basically allow a programmer to select a Java source from the Project Explorer and by selecting the corresponding DropDown menu option it will creates an interface .java file based on the one selected.
Everything works fine, but now I need to program the update part of the job.
The update requierement is simple, I need to listen for changes and identify that the sources that have the interface generated have been modified and recreate the interface file.
To do this I wrote a class that implements IResourceChangeListener interface.
That class looks like:
public class DTOChangeListener implements IResourceChangeListener {
private List<UpdatedUnit> updatedUnits;
public DTOChangeListener() {
super();
this.updatedUnits=new ArrayList<UpdatedUnit>();
}
#Override
public void resourceChanged(IResourceChangeEvent event) {
try{
if(event.getType() == IResourceChangeEvent.POST_CHANGE){
event.getDelta().accept(this.buildVisitor());
}
}catch(CoreException ex){
ex.printStackTrace();
}
}
protected IResourceDeltaVisitor buildVisitor(){
IResourceDeltaVisitor result=new IResourceDeltaVisitor() {
#Override
public boolean visit(IResourceDelta resDelta) throws CoreException {
String resName=resDelta.getResource().getName();
if(resName==null || resName.equals("")){
return true;
}
String[] splits=resName.split("\\.");
String name = splits[0];
if(name.contains("PropertyAccess")){
return false;
}
String interfaceName=name + "PropertyAccess";
String interfaceFile=interfaceName + ".java";
IResource res=resDelta.getResource();
if((res instanceof IFolder) || (res instanceof IProject)){
// Avoid Folder & Project Nodes
return true;
}
IProject project=res.getProject();
if(project!=null){
if(project.isNatureEnabled("org.eclipse.jdt.core.javanature")){
IJavaElement element=JavaCore.create(res);
if(element instanceof ICompilationUnit){
ICompilationUnit unit=(ICompilationUnit)element;
IPath path=res.getProjectRelativePath().removeLastSegments(1);
IResource propertyAccess=project.findMember(path.append(interfaceFile));
if(propertyAccess!=null){
UpdatedUnit updatedUnit=new UpdatedUnit(project, path, unit);
updatedUnits.add(updatedUnit);
return false;
}
}
}
}
return true;
}
};
return result;
}
public List<UpdatedUnit> getUpdatedUnits() {
return updatedUnits;
}
}
I add the Listener to the Workspace, now the question I have is:
How can I know when the updatedUnits List is completed in order to proccess the list with my own code?
One posible answer to this question would be, don't worry, the:
event.getData().accept(this.buildVisitor());
will block until proccessing of the visitor finish.
but at least is not documented like it would.
Any ideas would be appreciated.
Thanks in Advance.
Daniel
Unless it's documented to not block, it blocks.

Are there any patterns for component versioning and backwards-compatibility using Windsor?

I have to support a new input file format in a system which uses Windsor. I also need to support the old version of the input file during a transition phase.
This will probably be repeated in future, and we'll again need to support the new and the next most recent format.
The import processing is handled by a component, and the new version has had significant improvements in the code which makes it lots more efficient compared to the old version. So what I'd like to do is to have the new component and the old component in the system, and dynamically use the new or the old component based upon the file metadata.
Is there a pattern for this type of scenario anyone can suggest?
The fact that you're using Windsor is pretty much irrelevant here. Always strive to find a container-independent solution. Here's one:
interface IImportProcessor {
bool CanHandleVersion(int version);
Stream Import(Stream input);
}
class ImportProcessorVersion1 : IImportProcessor {
public bool CanHandleVersion(int version) {
return version == 1;
}
public Stream Import(Stream input) {
// do stuff
return input;
}
}
class ImportProcessorVersion2 : IImportProcessor {
public bool CanHandleVersion(int version) {
return version == 2;
}
public Stream Import(Stream input) {
// do stuff
return input;
}
}
class MainImportProcessor: IImportProcessor {
private readonly IImportProcessor[] versionSpecificProcessors;
public MainImportProcessor(IImportProcessor[] versionSpecificProcessors) {
this.versionSpecificProcessors = versionSpecificProcessors;
}
public bool CanHandleVersion(int version) {
return versionSpecificProcessors.Any(p => p.CanHandleVersion(version));
}
private int FetchVersion(Stream input) {
// do stuff
return 1;
}
public Stream Import(Stream input) {
int version = FetchVersion(input);
var processor = versionSpecificProcessors.FirstOrDefault(p => p.CanHandleVersion(version));
if (processor == null)
throw new Exception("Unsupported version " + version);
return processor.Import(input);
}
}
Your app would take a dependency on IImportProcessor. The container is wired so that the default implementation of this interface is MainImportProcessor. The container is also wired so that MainImportProcessor gets all other implementations of IImportProcessor.
This way you can add implementations of IImportProcessor and each will be selected when appropriate.
It might be easier to wire things up if MainImportProcessor implements an interface different from IImportProcessor.
Another possibility could be implementing a chain of responsibility.