Using .traineddata with passportEye Python for MRZ - tesseract

I am trying to improve accuracy of passport MRZ reading with tesseract ocr and passportEye I have found few github repositories containing "*.traineddata", it says to move it into tesseract ocr tessdata folder, I did that. No where in readme of these repos says how to use it, I believe it is something trivial, but I am very new to this tesseract thing.
How do I use it with passportEye in python, I am completely lost here. searched a lot. Here is the current code.
import os
from passporteye import read_mrz
pr_path = os.getcwd()
file_path = os.path.join(pr_path,'my_app', 'data')
mrz = read_mrz(file_path + '/test1.jpg')
print(mrz)
This is the .traineddata file I want to test for more accuracy : https://github.com/DoubangoTelecom/tesseractMRZ/blob/master/tessdata_best/mrz.traineddata
I do not want to use bulky openCV. Please help

From looking into the source code I would say you can`t, without changing the codebase of PassportEye:
Normally you would pass the language you are using via: -l paramerter to tesseract - in your case:
-l mrz
But the PassportEye implementation does not give you that option:
https://github.com/konstantint/PassportEye/blob/929c186c4dfa80a1ac975b5f2b95002ca12889d0/passporteye/util/ocr.py#L48
they pass lang=None, you would need to change that part to lang=mrz
pytesseract.run_tesseract(input_file_name,
output_file_name_base,
'txt',
lang='mrz',
config=config)

Related

How does one read multiple DICOM and PNG files at once using pydicom.read_file() and cv2.imread()?

Currently working on a Fully CNN for renal segmentation in MR images. Have 40 images and their ground truth labels, attempting to load all of the images for pre-processing purposes.
Using Google Colab, with the latest versions of pydicom and pip installed, for this project. Currently have the Google Drive mounted to the Colab program and the code below shows the correct pathways to the images and their masks in the pydicom.read_file() and cv2.imread() calls, respectively.
However, when I use the "/../IMG*.dcm" or "/../IMG*.png" file paths (which should be legal?), I receive a "FileNotFoundError" as listed below. But, when I specify a specific .dcm or .png image, the pydicom.read_file() and cv2.imread() calls function quite normally.
Any suggestions on how to resolve this issue? I am struggling a lot with loading the data and pre-processing but have the model architecture ready to go once these preliminary hurdles are overcome.
#import data as data
import pydicom
import numpy as np
images= pydicom.read_file("/content/drive/My Drive/CHOAS_Kidney_Labels/Training_Images/T1DUAL/IMG*.dcm");
numpyArray = images.pixel_array
masks= cv2.imread("/content/drive/My Drive/CHOAS_Kidney_Labels/Ground_Truth_Training/T1DUAL/IMG*.png");
-----> FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/CHOAS_Kidney_Labels/Training_Images/T1DUAL/IMG*.dcm'
pydicom.read_file does not support wildcards. You have to iterate over the files yourself, something like (untested):
import glob
import pydicom
pixel_data = []
paths = glob.glob("/content/drive/My Drive/CHOAS_Kidney_Labels/Training_Images/T1DUAL/IMG*.dcm")
for path in paths:
dataset = pydicom.dcmread(path)
pixel_data.append(dataset.pixel_array)

How to download a file (csv.gz) from a url using Python 3.7

As with others who have posted in the past, I cannot figure out to download a csv.gz file from a URL in Python 3.7. I see posts but they only post a 2kb file.
I am a 100% newbie using Python. What follows is the code for one file that I am trying to obtain. I can't even do that. The final goal would be to request all files that start with 2019* using python. Please try the code below to save the file. As others stated, the file is just a name without the true content - Ref: Downloading a csv.gz file from url in Python
import requests
url = 'https://public.bitmex.com/?prefix=data/trade/20191026.csv.gz'
r = requests.get(url, allow_redirects=True)
open('20191026.csv.gz', 'wb').write(r.content)
Yields:
Out[40]:
1245
I've tried "wget" and urllib.request along with "urlretrieve" also.
I wish I could add a screenshot or attach a file. The file created is 2kb and not even a csv.gz file. But the true file that I can download from a web browser is 78mb. The file is 20191026.csv.gz not that it matters as they all do the same thing. The location is https://public.bitmex.com/?prefix=data/trade/
Again, if you know of a way to obtain all the files using a filter such that 2019*csv.gz would be fantastic.
You are trying to download the files from https://public.bitmex.com/?prefix=data/trade/.
To achieve your final goal of download all the files starting from 2019* you have to do in 3 steps
1) you read the content of https://public.bitmex.com/?prefix=data/trade/
2) convert the content into an list, from that filter out the file names which starting from 2019.
3) from the result list try to download the csv using the example which you referring.
Hope this approach will help you
Happy coding.

Importing library in Dart on Windows

I've been trying to make a library in Dart and import it in my project. Though for some reason it won't do it.
Here's how it looks:
It says it can't find the library, though the path is correct. I also tried a bunch of other paths:
SmartCanvas.dart
SmartCanvas/SmartCanvas.dart
SmartCanvas
SmartCanvas/SmartCanvas
./SmartCanvas/SmartCanvas.dart
../SmartCanvas/SmartCanvas.dart
./SmartCanvas.dart
../SmartCanvas.dart
./SmartCanvas
../SmartCanvas
Note: The project I'm trying to import this library into is located somewhere totally different on my harddrave (my dropbox folder.)
Anyone knows what I should use as path, or how I can import the library properly?
Thanks!
#import expects a full path or correct relative path to a .dart file that has the #library line.
Here is an example from working code:
https://github.com/johnmccutchan/DartVectorMath/blob/master/test/console_test_harness.dart
At the top you see #import('../lib/vector_math_console.dart');
which is located:
https://github.com/johnmccutchan/DartVectorMath/blob/master/lib/vector_math_console.dart
Chopping off the github url prefix, we are left with:
test/console_test_harness.dart
lib/vector_math_console.dart
The import line uses the correct relative path from test/ into ../lib/ to find vector_math_console.dart (the library).
HTH,
John
Try this for windows
#import('/c:/users/pablo/pablo\'s documents/projects/smartcanvas/smartcanvas.dart');
To import local libraries in dart, I'd recommend using the the path dependency in the pubspec.yaml. This is a much cleaner approach then embedding absolute paths in the dart code.
Read about it here: https://www.dartlang.org/tools/pub/dependencies.html#path-packages

Weka EM cluster get "Error: Could not find or load main class test" in eclipse

I want to use weka to cluster tweets in the database in JSP. In GUI, I find only HierarchiccalClusterer and Filteredcluster available for string clustering. Then I find this clusteringdemo sample code from weka official website: https://svn.scms.waikato.ac.nz/svn/weka/trunk/wekaexamples/src/main/java/wekaexamples/clusterers/ClusteringDemo.java
However, after set up the sample arff code in weka directory, I get this error "Error: Could not find or load main class ClusteringDemo".
Can anyone help me to find out the reason?
I only change filename in the sentence data = DataSource.read(filename);. Besides, my classpath set up correctly for I already done some classifier.
1.- Maybe the ClusteringDemo.class is not in your classpath.
You should add the class of jar file to your project.
2.- Anyway, you can download the java code from: http://weka.wikispaces.com/file/detail/ClusteringDemo.java
Compile and run it (make sure that weka.jar is in your classpath).
3.- If you have added ClusteringDemo.java to your project. Make sure that it has the "package" line (the first line) according to its location. Otherwise Java will not be able to find it.
Good luck using EM, maybe you can also try N-grams + Naive Bayes.

What is the best file parsing solution for converting files?

I am looking for the best solution for custom file parsing for our enterprise import routines. I want to basically change one file format into a standard file format and have one routine that imports that data into the database. I need to be able to create custom scripts for each client since its difficult to get the customer to comply with a standard or template format. I have looked at PowerShell and Iron Python to do this so far but I am not sure this is the route I want to go. I have also looked at some tools such as Talend which is a drag and drop style tool which may or may not give me what I want as far as flexibility. We are a .NET shop and have created custom code to do this in the past but I need something that is quicker to create then coding custom parsing functions each time we get a new file format in.
Depending on the complexity and variability of your work, you should consider an ETL tool like SSIS (SQL Server Integration Services).
Python is wonderful for this kind of thing. That's why we use. Each new customer transfer is a new adventure and Python gives us the flexibility to respond quickly.
Edit. All python scripts that read files are "custom file parsers". Without an actual example, it's not sensible to provide a detailed example.
with open( "some file", "r" ) as source:
for line in source:
process( line )
That's about all there is to a "custom file parser". If you're parsing .csv or .xml files, then Python has modules for that. If you're parsing fixed-format files, you'd use string slicing operations. If you're parsing other files (X12? JSON? YAML?) you'll need appropriate parsers.
Tab-Delim.
from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
record = RecordLayout( aLine.split('\t') )
...
Fixed Layout.
from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
fields = ( aLine[:10], aLine[10:20], aLine[20:30], ... )
record = RecordLayout( fields )
...