Get font of recognized character with Tesseract-OCR

Get font of recognized character with Tesseract-OCR - tesseract

Is it possible to get the font of the recognized characters with Tesseract-OCR, i.e. are they Arial or Times New Roman, either from the command-line or using the API.
I'm scanning documents that might have different parts with different fonts, and it would be useful to have this information.

Tesseract has an API WordFontAttributes function defined in ResultIterator class that you can use.

Based on nguyenq's answer i wrote a simple python script that prints the font name for each detected char. This script uses the python lib tesserocr.
from tesserocr import PyTessBaseAPI, RIL, iterate_level
def get_font(image_path):
with PyTessBaseAPI() as api:
api.SetImageFile(image_path)
api.Recognize()
ri = api.GetIterator()
level = RIL.SYMBOL
for r in iterate_level(ri, level):
symbol = r.GetUTF8Text(level)
word_attributes = r.WordFontAttributes()
if symbol:
print(u'symbol {}, font: {}'.format(symbol, word_attributes['font_name']))
get_font('logo.jpg')

Related

What is an example for non unicode character set for -Dfile.encoding=?

I have a JVM. where character set as "-Dfile.encoding=UTF-8" . This is how UTF-8 is set. I would want to set it to a non Unicode character set.
Is there an example/value for non unicode character set so that I can set to -Dfile.encoding= ?

[ TLDR => Application encoding a confusing issue, but this document from Oracle should help. ]
First a few important general points about specifying the encoding by setting the System Property file.encoding at run time:
It's use is not formally supported, and never has been. From a Java Bug Report in 1998:
The "file.encoding" property is not required by the J2SE platform
specification; it's an internal detail of Sun's implementations and
should not be examined or modified by user code. It's also intended
to be read-only; it's technically impossible to support the setting of
this property to arbitrary values on the command line or at any other
time during program execution.
There is a draft JEP (JDK Enhancement Proposal), JDK-8187041
Use UTF-8 as default Charset, which proposes:
Use UTF-8 as the Java virtual machine's default charset so that APIs
that depend on the default charset behave consistently across all
platforms.
It doesn't necessarily make sense to claim that "This application uses encoding {x}" since there may be multiple encodings associated with an application, which can be addressed in different ways, including:
The file encoding for console output.
The file encoding of the application's source files.
The file encoding(s) for file I/O.
The file encoding of file paths.
All that said, Oracle specify all encodings supported by Java SE 8. I can't find a corresponding document for more recent JDK versions. Note that:
Encodings can be environment specific, based on locale, operating system, Java version, etc.
Almost every encoding has at least one alias. For example, the encoding name for simplified Chinese is GBK, but you could also use CP936 or windows-936.
Most encoding are non Unicode since Unicode encoding names contain the string "UTF".
An encoding name can vary depending on how the application is processing files (java.nio APIs vs. java.io/java.lang APIs.). For example, if performing some I/O on Turkish files on Windows:
If the java.nio.* classes are used, specify -Dfile.encoding=windows-1254 at runtime.
If the java.lang.* & java.io.* classes are used, specify -Dfile.encoding=Cp1254 at runtime.
This DZone article provides a useful piece of code to show how setting -Dfile.encoding at runtime can impact various settings:
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
import java.util.Locale;
import static java.lang.System.out;
/**
* Demonstrate default Charset-related details.
*/
public class CharsetDemo
{
/**
* Supplies the default encoding without using Charset.defaultCharset()
* and without accessing System.getProperty("file.encoding").
*
* #return Default encoding (default charset).
*/
public static String getEncoding()
{
final byte [] bytes = {'D'};
final InputStream inputStream = new ByteArrayInputStream(bytes);
final InputStreamReader reader = new InputStreamReader(inputStream);
final String encoding = reader.getEncoding();
return encoding;
}
public static void main(final String[] arguments)
{
out.println("Default Locale: " + Locale.getDefault());
out.println("Default Charset: " + Charset.defaultCharset());
out.println("file.encoding; " + System.getProperty("file.encoding"));
out.println("sun.jnu.encoding: " + System.getProperty("sun.jnu.encoding"));
out.println("Default Encoding: " + getEncoding());
}
}
Here's some sample output when specifying -Dfile.encoding=860 (an alias for MS-DOS Portuguese) using Java 12 on Windows 10:
run:
Default Locale: en_US
Default Charset: IBM860
file.encoding: 860
sun.jnu.encoding: Cp1252
Default Encoding: Cp860
BUILD SUCCESSFUL (total time: 0 seconds)
Test the encoding you plan to specify at run time on all target platforms. You may get unexpected results. For example, when I run the code above on Windows 10 with -Dfile.encoding=IBM864 (PC Arabic) it works, but fails with -Dfile.encoding=IBM420 (IBM Arabic).

Python w/QT Creator form - Possible to grab multiple values?

I'm surprised to not find a previous question about this, but I did give an honest try before posting.
I've created a ui with Qt Creator which contains quite a few QtWidgets of type QLineEdit, QTextEdit, and QCheckbox. I've used pyuic5 to convert to a .py file for use in a small python app. I've successfully got the form connected and working, but this is my first time using python with forms.
I'm searching to see if there is a built-in function or object that would allow me to pull the ObjectNames and Values of all widgets contained within the GUI form and store them in a dictionary with associated keys:values, because I need to send off the information for post-processing.
I guess something like this would work manually:
...
dict = []
dict['checkboxName1'] = self.checkboxName1.isChecked()
dict['checkboxName2'] = self.checkboxName2.isChecked()
dict['checkboxName3'] = self.checkboxName3.isChecked()
dict['checkboxName4'] = self.checkboxName4.isChecked()
dict['lineEditName1'] = self.lineEditName1.text()
... and on and on
But is there a way to grab all the objects and loop through them, even if each different type (i.e. checkboxes, lineedits, etc) needs to be done separately?
I hope I've explained that clearly.
Thank you.

Finally got it working. Couldn't find a python specific example anywhere, so through trial and error this worked perfectly. I'm including the entire working code of a .py file that can generate a list of all QCheckBox objectNames on a properly referenced form.
I named my form main_form.ui from within Qt Creator. I then converted it into a .py file with pyuic5
pyuic5 main_form.ui -o main_form.py
This is the contents of a sandbox.py file:
from PyQt5 import QtCore, QtGui, QtWidgets
import sys
import main_form
# the name of my Qt Creator .ui form converted to main_form.py with pyuic5
# pyuic5 original_form_name_in_creator.ui -o main_form.py
class MainApp(QtWidgets.QMainWindow, main_form.Ui_MainWindow):
def __init__(self):
super(self.__class__, self).__init__()
self.setupUi(self)
# Push button object on main_form named btn_test
self.btn_test.clicked.connect(self.runTest)
def runTest(self):
# I believe this creates a List of all QCheckBox objects on entire UI page
c = self.findChildren(QtWidgets.QCheckBox)
# This is just to show how to access objectName property as an example
for box in c:
print(box.objectName())
def main():
app = QtWidgets.QApplication(sys.argv) # A new instance of QApplication
form = MainApp() # We set the form to be our ExampleApp (design)
form.show() # Show the form
app.exec_() # and execute the app
if __name__ == '__main__': # if we're running file directly and not importing it
main() # run the main function

See QObject::findChildren()
In C++ the template argument would allow one to specify which type of widget to retrieve, e.g. to just retrieve the QLineEdit objects, but I don't know if or how that is mapped into Python.
Might need to retrieve all types and then switch handling while iterating over the resulting list.

Do Flask/Jinja2 have a provision to save rendered templates during debugging?

While debugging it's useful to view the rendered HTML and JS templates through a "view source" menu item in a browser, but doing so forces one to use the UI of the browser.
Does Jinja2 (or Flask) provide a facility to save the last n rendered templates on the server? It would then be possible to use one's favorite editor to view the rendered files, along with using one's familiar font-locking and search facilities.
It's of course possible to implement such a facility by hand, but doing so smacks too much like peppering one's programs while debugging with print statements, an approach that doesn't scale. I'm seeking a better alternative.

I'd think the easiest thing to do would be to use the after_request hook.
from flask import g
#main.route('/')
def index():
models = Model.query.all()
g.template = 'index'
return render_template('index.html', models=models)
#main.after_request
def store_template(response):
if hasattr(g, 'template'):
with open('debug/{0}-{1}'.format(datetime.now(), g.template), 'w') as f:
f.write(response.data)
return response
Here are the docs.
http://flask.pocoo.org/snippets/53/
As far as only collecting the last n templates I'd likely setup a cron job to do that. Here is an example
import os
from datetime import datetime
def make_files(n):
text = '''
<html>
</html>
'''
for a in range(n):
with open('debug/index-{0}.html'.format(datetime.now()), 'w') as f:
f.write(text)
def get_files(dir):
return [file for file in os.listdir(dir) if file.endswith('.html')]
def delete_files(dir, files, amount_kept):
rev = files[::-1]
for file in rev[amount_kept:]:
loc = dir + '/' + file
os.remove(loc)
if __name__ == '__main__':
make_files(7)
files = get_files('debug')
print files
delete_files('debug', files, 5)
files = get_files('debug')
print files
EDIT
reversed order of files inside the delete function so it will keep the most recent files. Also unable to find a way of accessing the original template name to avoid hardcoding.
EDIT 2
Alright so updated it to show how you can use flask.g to pass the template name to the after_request function
docs http://flask.pocoo.org/docs/0.11/testing/#faking-resources

Gtk.stock is deprecated, what's the alternative?

I've been learning to develop to Gtk and most of the examples online suggests the use of Gtk.stock icons. However, its use produces warnings that it has been deprecated and I can't find the alternative to these icons.
Code examples are:
open_button:Gtk.ToolButton = new ToolButton.from_stock(Stock.OPEN)
open_button.clicked.connect (openfile)
new_button:Gtk.ToolButton = new ToolButton.from_stock(Stock.NEW)
new_button.clicked.connect (createNew)
save_button:Gtk.ToolButton = new ToolButton.from_stock(Stock.SAVE)
save_button.clicked.connect (saveFile)
That generates error as:
/tmp/text_editor-exercise_7_1.vala.c:258:2: warning: 'GtkStock' is deprecated [-Wdeprecated-declarations]
_tmp1_ = (GtkToolButton*) gtk_tool_button_new_from_stock (GTK_STOCK_OPEN);
Which is the alternative and how it would look in the code above?

GTK+3 has moved over to the freedesktop.org Icon Naming Specification and internationalised labels. Taking Gtk.Stock.OPEN as an example. The GNOME Developer documentation for GTK_STOCK_OPEN gives two replacements:
GTK_STOCK_OPEN has been deprecated since version 3.10 and should not be used in newly-written code. Use named icon "document-open" or the label "_Open".
The Named Icon Method
The named icon method would be something like:
var open_icon = new Gtk.Image.from_icon_name( "document-open",
IconSize.SMALL_TOOLBAR
)
var open_button = new Gtk.ToolButton( open_icon, null )
The Label Method
The label method makes use of gettext to translate the label in to the current runtime language of the program. This is indicated by the underscore before the label. The line in your program would be:
var open_button = new Gtk.ToolButton( null, dgettext( "gtk30", "_Open") )
gettext uses domains, which are files containing the translations. The Gtk+3 domain is gtk30. You will also need to add a line at the beginning of your program to change the default locale for the C language, which is US English ASCII, to the locale of the run time environment:
init
Intl.setlocale()
To compile the Genie program you will need to set the default domain for gettext. This is usually set to nothing:
valac -X -DGETTEXT_PACKAGE --pkg gtk+-3.0 my_program.gs
When you run your program you will get the "_Open" translated to your locale. You can also change the locale. If you have the French locale installed then running the program with:
LC_ALL=fr ./my_program
will have the "_Open" label appear in French.
You may see in examples _( "_OPEN" ). The _() is a function like dgettext but uses a default domain. You may want to keep the default domain to the translation file for your own program. Using _( "_translate me" ) is a bit less typing that dgettext( "mydomain", "_translate me" ). To set the default domain in Genie add a line before init:
const GETTEXT_PACKAGE:string = "mydomain"
init
Intl.setlocale()

Executing Python Code in IPython Kernel

I want to duplicate ipython notebook capability in Emacs / Pymacs; and I need some direction for a simple code that can 1) send python / "magics" code to a ipython kernel 2) receive the display output, as a string. I found this comment by minrk, the "ipython kernel" example did not work, it gave "ImportError: No module named zmq.blockingkernelmanager".
I had better luck with one his other pointers, finally I landed at ipython-1.1.0/IPython/kernel/inprocess/tests/test_kernel.py, I ripped out a minimal part, and coded an Emacs extension called pytexipy-notebook. It's on Github
goo.gl/kQzJW1
If anyone knows of better examples, such as connecting to an existing (out of process), I'd like to hear about these.
Thanks in advance,

Here is a sample for ipython 3.0.
from IPython.testing.globalipapp import get_ipython
from IPython.utils.io import capture_output
ip = get_ipython()
def run_cell(cmd):
with capture_output() as io:
res = ip.run_cell(content)
print 'suc', res.success
print 'res', res.result
res_out = io.stdout
print 'res out', res_out
content = "print (111+222)"
run_cell(content)
content = "alsdkjflajksf"
run_cell(content)
I will soon update
https://github.com/burakbayramli/emacs-ipython

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Get font of recognized character with Tesseract-OCR - tesseract

Is it possible to get the font of the recognized characters with Tesseract-OCR, i.e. are they Arial or Times New Roman, either from the command-line or using the API. I'm scanning documents that might have different parts with different fonts, and it would be useful to have this information.

Tesseract has an API WordFontAttributes function defined in ResultIterator class that you can use.

Related

What is an example for non unicode character set for -Dfile.encoding=?

Python w/QT Creator form - Possible to grab multiple values?

Do Flask/Jinja2 have a provision to save rendered templates during debugging?

Gtk.stock is deprecated, what's the alternative?

Executing Python Code in IPython Kernel

Categories

Resources