How to detect language or script from an input image using Python or Tesseract OCR? - tesseract

Given an input image which can be in any language or writing system, how do I detect what script the text in the picture uses?
Any Python-based or Tesseract-OCR based solution would be appreciated.
Note that script here means writing systems like Latin, Cyrillic, Devanagari, etc., for corresponding languages like English, Russian, Hindi, etc. (respectively)

Pre-requisites:
Install Tesseract: sudo apt install tesseract-ocr tesseract-ocr-all
Install PyTessract: pip install pytesseract
Script-Detection:
import pytesseract
import re
def detect_image_lang(img_path):
try:
osd = pytesseract.image_to_osd(img_path)
script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
return script, float(conf)
except e:
return None, 0.0
script_name, confidence = detect_image_lang("image.png")
Language-Detection:
After performing OCR (using Tesseract), pass the text through langdetect library (or any other lib).

Related

How can I run pytesseract / tesseract in Foundry Code Repositories?

I am trying to use the function image_to_string from the library pytesseract in a repository to perform OCR of PDFs. However, I am getting the following error:
From the checks I would assume the library was loaded correctly:
Does anyone have an idea how to trouble shoot here?
It seems like Foundry is not respecting / running the environment activation script
https://github.com/conda-forge/tesseract-feedstock/blob/main/recipe/activate.sh
that sets the TESSDATA_PREFIX environment variable automatically. However, we can infer the value manually and provide it to the pytesseract API calls.
Define the following helper function:
def _get_tessdata_directory_path():
import sys
from pathlib import Path
env_root = Path(sys.executable).parent.parent
share_dir = env_root / 'share' / 'tessdata'
assert share_dir.exists(), 'tessdata directory does not exist in <envroot>/share/tessdata'
return str(share_dir)
and use it like shown in the following snippet:
tessdata_dir_config = f'--tessdata-dir "{_get_tessdata_directory_path()}"'
pytesseract.image_to_string(image, ..., config=tessdata_dir_config)

ModuleNotFoundError: No module named 'gspread' on python anywhere

What I am trying to achieve.
Run a python script saved On pythonanywhere host from google sheets on a button press.
Check the answer by Dustin Michels
Task of Each File?
app.py: contains code of REST API made using Flask.
runMe.py: contains code for that get values from(google sheet cell A1:A2). And sum both values send sum back to A3.
main.py: contains code for a GET request with an argument as name(runMe.py).filename may change if the user wants to run another file.
I Made an API by using Flask.it works online and offline perfectly but still, if you want to recommend anything related to the app.py.Code Review App.py
from flask import Flask, jsonify
from flask_restful import Api, Resource
import os
app = Flask(__name__)
api = Api(app)
class callApi(Resource):
def get(self, file_name):
my_dir = os.path.dirname(__file__)
file_path = os.path.join(my_dir, file_name)
file = open(file_path)
getvalues = {}
exec(file.read(), getvalues)
return jsonify({'data': getvalues['total']})
api.add_resource(callApi, "/callApi/<string:file_name>")
if __name__ == '__main__':
app.run()
Here is the Code of runMe2.py
import gspread
from oauth2client.service_account import ServiceAccountCredentials
# use creds to create a client to interact with the Google Drive API
scopes =['https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name('service_account.json', scopes)
client = gspread.authorize(creds)
# Find a workbook by name and open the first sheet
# Make sure you use the right name here.
sheet = client.open("Demosheet").sheet1
# Extract and print all of the values
list_of_hashes = sheet.get_all_records()
print(list_of_hashes)
below is the main.py code
import requests
BASE = 'https://username.pythonanywhere.com/callApi/test.py'
response = requests.get(BASE)
print(response.json())
main.py output
{'data': 54}
Test.py code
a = 20
b = 34
total = a+b
print(total)
PROBLEM IS
if I request runMe2.py at that time I am got this error.
check runMe2.py code above
app.py is hosted on https://www.pythonanywhere.com/
ModuleNotFoundError: No module named 'gspread'
However, I installed gspread on pythonanywhere why using the command. but it's not working.
You either haven't installed the gspread package on your current python environment or it is installed somewhere (e.g. in a diff. virtual env) and your script cant find it.
Try installing the package inside the environment your running your script in using pip3:
pip3 install gspread
You can try something like this on Windows
pip install gspread
or on Mac
pip3 install gspread
If you're running on Docker, or building with a requirements.txt you can try adding this line you your requirements.txt file
gspread==3.7.0
Any other instructions for this package can be found here => https://github.com/burnash/gspread
Download gspread here
Download the tar file: gspread-3.7.0.tar.gz from the above link
Extract file and convert folder in zip then upload it back on server
Open bash console and use command as
$ unzip gspread-3.7.0
$ cd gspread-3.7.0
$ python3.7 setup.py install --user

MATLAB — Unable to Import cv2 Library

I'm a beginner at OpenCV, and trying to run an open-source program.
http://asrl.utias.utoronto.ca/code/gpusurf/index.html
I currently have the Computer Vision Toolbox OpenCV Interface 20.1.0 installed and Computer Vision Toolbox 9.2.
I cannot run this simple open-source feature matching algorithm without encountering errors.
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
% read images
img1 = cv2.imread('[INSERT PATH #1]');
img2 = cv2.imread('[INSERT PATH #2]');
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY);
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY);
%sift
sift = cv2.xfeatures2d.SIFT_create();
keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None);
keypoints_2, descriptors_2 = sift.detectAndCompute(img2,None);
len(keypoints_1), len(keypoints_2)
The following message is returned:
Error: File: Keypoints.m Line: 1 Column: 8
The import statement 'import cv2' cannot be found or cannot be imported. Imported names must end with '.*' or be
fully qualified.
However, when I remove Line 1, I instead get the following error.
Error: File: Keypoints.m Line: 2 Column: 8
The import statement 'import matplotlib.pyplot' cannot be found or cannot be imported. Imported names must end
with '.*' or be fully qualified.
Finally, following the error message only results in a sequence of further errors from the cv2 library. Any ideas?
That's because the code you've used isn't MATLAB code, it's python code.
As per the website you've linked:
From within Matlab
The parallel implementation coded in Matlab can be run by using the surf_find_keypoints() function. The output keypoints can be sorted by strength using surf_best_n_keypoints(), and plotted using surf_plot_keypoints().
Check that you've downloaded the correct files and try again.
Furthermore, the Matlab OpenCV Interface is designed to integrate C++ OpenCV code, not python. Documentations here.
Yes, it is correct that this is Python code. I would recommend checking your dependencies/libraries. The PyCharm IDE is what I personally use since it takes care of all the libraries easily.
If you do end up trying out PyCharm click on the red icon when hovering on CV2. It’ll then give you a prompt to download the library.
Edit:
Using Python some setup can be done. Using pip:
Install opencv-python
pip install opencv-python
Install opencv-contrib-python
pip install opencv-contrib-python
Unfortunately, there is some issue with the sift feature since by default it is excluded from newer free versions of OpenCV.
sift = cv2.xfeatures2d.SIFT_create() not working even though have contrib installed
import cv2
Image_1 = cv2.imread("Image_1.png", cv2.IMREAD_COLOR)
Image_2 = cv2.imread("Image_2.jpg", cv2.IMREAD_COLOR)
Image_1 = cv2.cvtColor(Image_1, cv2.COLOR_BGR2GRAY)
Image_2 = cv2.cvtColor(Image_2, cv2.COLOR_BGR2GRAY)
sift = cv2.SIFT_create()
keypoints_1, descriptors_1 = sift.detectAndCompute(Image_1,None)
keypoints_2, descriptors_2 = sift.detectAndCompute(Image_2,None)
len(keypoints_1), len(keypoints_2)
The error I received:
"/Users/michael/Documents/PYTHON/Test Folder/venv/bin/python" "/Users/michael/Documents/PYTHON/Test Folder/Testing.py"
Traceback (most recent call last):
File "/Users/michael/Documents/PYTHON/Test Folder/Testing.py", line 9, in <module>
sift = cv2.SIFT_create()
AttributeError: module 'cv2.cv2' has no attribute 'SIFT_create'
Process finished with exit code 1

How to import modules in IPython Clusters

I am trying to import some of my personal modules into my IPython Clusters. I am using Anacondas on Windows Vista 64 bit
from IPython.parallel import Client
rc = Client()
dview = rc[:]
with dview.sync_imports():
import lib.rf
It is giving me this error:
No module named 'lib.rf'
I can import the module in the rest of my IPython notebook, as I have this .bat file to start ipython notebook:
cd C:\Users\Jon\workspace\bf
set PYTHONPATH=%PYTHONPATH%;C:\Users\Jon\workspace\bf
C:\Anaconda\envs\p33\scripts\ipython notebook
I am using this similar code to start my ip clusters:
cd C:\Users\Jon\workspace\bf
set PYTHONPATH=%PYTHONPATH%;C:\Users\Jon\workspace\bf
C:\Anaconda\envs\p33\Scripts\ipcluster start --n=7
Why is this not working?
More info:
If I print out sys.path, I get a list that contains C:\Users\Jon\workspace\bf
If I print out the paths of my clusters, I get the same list:
%px sys.path
['',
'',
'',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\distribute-0.6.28-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\pykalman-0.9.5-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\patsy-0.2.1-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\joblib-0.8.3_r1-py3.3.egg',
'C:\\Users\\Jon\\workspace\\bf',
'C:\\Users\\Jon\\workspace\\bf\\my_numba',
'C:\\Anaconda\\envs\\p33\\python33.zip',
'C:\\Anaconda\\envs\\p33\\DLLs',
'C:\\Anaconda\\envs\\p33\\lib',
'C:\\Anaconda\\envs\\p33',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\Sphinx-1.2.3-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32\\lib',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\Pythonwin',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\runipy-0.1.1-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\setuptools-7.0-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\IPython\\extensions']
In [45]:
Further analysis:
%px lib.__path__
Out[0:11]: _NamespacePath(['C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32\\lib'])
lib.__path__
Out[57]: ['.\\lib']
Looks like the ipcluster and notebook are looking at lib in different places. I have tried renaming lib to mylib. It has not helped.
It seems that with dview.sync_imports() is being run someplace other than your IPython Notebook environment and is therefore relying a different PYTHONPATH. It is definitely not being run on one of the cluster engines and so wouldn't expect it to leverage your cluster settings of PYTHONPATH.
I'm thinking you'll need to have that directory in your PYTHONPATH (not your PATH) for the calling python environment because that is the location from which you are importing the modules.
The impact of the bit you have about setting the PYTHONPATH in the DOS shell from which you invoke ipclusters isn't clear to me. I can see that one might expect this to let the engines know about your directory, but I'm wondering if that PYTHONPATH gets initilized to the environment from which you call IPython.parallel.Client.

Pyinstaller --onefile warning pyconfig.h when importing scipy or scipy.signal

This is very simple to recreate.
If my script foo.py is:
import scipy
Then run:
python pyinstaller.py --onefile foo.py
When I launch foo.exe I get:
WARNING: file already exists but should not: C:\Users\username\AppData\Local\Temp\_MEI86402\Include\pyconfig.h
I've tested a few versions but the latest I've confirmed is 2.1dev-e958e02 running on Win7, Python 2.7.5 (32 bit), Scipy version 0.12.0
I've submitted a ticket with the Pyinstaller folks but haven't heard anything yet. Any clues how to debug this further?
You can hack the spec file to remove the second instance by adding these lines after a=Analysis:
for d in a.datas:
if 'pyconfig' in d[0]:
a.datas.remove(d)
break
The answer by wtobia# worked for me. See https://github.com/pyinstaller/pyinstaller/issues/783
Go to C:\Python27\Lib\site-packages\PyInstaller\build.py
Find the def append(self, tpl): function.
Change if tpl[2] == "BINARY": to if tpl[2] in ["BINARY", "DATA"]:
Expanding upon Ilya's solution, I think this is a little bit more robust solution to modifying the spec file (again place after the a=Analysis... statement).
a.datas = list({tuple(map(str.upper, t)) for t in a.datas})
I only tested this on a small test program (one with a single import and print statement), but it seems to work. a.datas is a list of tuples of strings which contain the pyconfig.h paths. I convert them all to lowercase and then dedup. I actually found that converting all of them all to lowercase was sufficient to get it to work, which suggests to me that pyinstaller does case-sensitive deduping when it should be case-insensitive on Windows. However, I did the deduping myself for good measure.
I realized that the problem is that Windows is case-insensitive and these 2 statements are source directories are "duplicates:
include\pyconfig.h
Include\pyconfig.h
My solution is to manually tweak the .spec file with after the a=Analysis() call:
import platform
if platform.system().find("Windows")>= 0:
a.datas = [i for i in a.datas if i[0].find('Include') < 0]
This worked in my 2 tests.
A more flexible solution would be to check ALL items for case-insensitive collisions.
I ran the archive_viewer.py utility (from PyInstaller) on one of my own --onefile executables that has the same error and found that pyconfig.h is included twice:
(31374007, 6521, 21529, 1, 'x', 'include\\pyconfig.h'),
(31380528, 6521, 21529, 1, 'x', 'Include\\pyconfig.h'),
(31387049, 984, 2102, 1, 'x', 'pytz\\zoneinfo\\CET'),
Sadly though, I don't know how to fix it.
PyInstaller Manual link:
http://www.pyinstaller.org/export/d3398dd79b68901ae1edd761f3fe0f4ff19cfb1a/project/doc/Manual.html#archiveviewer