How to extract the details from image using co-ordinates - python-3.7

import time
import cv2
import pytesseract
import numpy as np
import pdf2image
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
x_axis = 2400
y_axis = 2700
pdf = pdf2image.convert_from_path(pdf_path='E:\\Rebecca\\VR115485 - 82520940 - NUCOR STEEL TUSCALOOSA - 400131.pdf',poppler_path='E:\\Ajai Krishna\\propeler\\poppler-0.68.0\\bin')
for _n in range(0, len(pdf)):
try:
img = pdf[_n].resize((x_axis, y_axis))
bag_of_words = []
clusters_coordinates = []
img_np = np.zeros([100, 100])
img_graph = cv2.resize(img_np, (x_axis, y_axis))
img = np.asarray(img)
# #img = cv2.medianBlur(img, 5)
text = str(pytesseract.image_to_string(img))
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
box = pytesseract.image_to_data(img_gray)
print(box)
label = "check no"
for index, b in enumerate(box.splitlines()):
if index != 0:
b = b.split()
if len(b) == 12:
x, y, w, h = (b[6]), int(b[7]), int(b[8]), int(b[9])
except:
pass
From box I get the solution:
5 1 2 1 2 6 1851 153 141 45 96.836044 Check
5 1 2 1 2 7 2018 156 56 44 96.992538 No
5 1 2 1 3 7 1852 220 167 43 92.319000 400131
5 1 2 1 2 2 301 155 112 43 38.483887 Name
5 1 2 1 3 3 300 220 141 43 57.061188 NOCOR
5 1 2 1 3 4 472 211 141 51 11.992462 STERI
5 1 2 1 3 5 640 183 307 135 58.077271 TUSCALOOSA
5 1 2 1 4 2 80 348 210 46 49.844437 VOUCHER
5 1 2 1 5 1 33 474 235 61 23.609283 ‘vRi15404
5 1 2 1 6 1 33 528 239 55 15.245552 “VRLI5485
5 1 3 1 1 1 51 605 222 42 38.442249 VR195486
5 1 2 1 4 3 293 315 263 78 4.121895 REFFRENCK
5 1 2 1 5 2 304 435 222 95 62.667671 82520840
5 1 2 1 6 2 304 540 222 43 89.974838 82520940
5 1 3 1 1 2 303 599 223 48 91.218178 82521040
Required solution:
Check No: 400131 ;;
VOUCHER : ‘vRi15404, “VRLI5485, VR195486 ;;
REFFRENCK: 82520840, 82520940, 82521040
Is there any solution to find the particular details based on the coordinates of the words using python tesseract

Related

Replace values of one pyspark dataframe with another

I have a pyspark dataframe df2 :-
ID
Total_Count
Final_A
Final_B
Final_C
Final_D
11
80
36
30
8
6
4
80
36
30
8
6
13
65
30
24
6
5
12
56
26
21
5
4
2
65
30
24
6
5
1
56
26
21
5
4
I have another dataframe df1 :-
ID
Total_Count
A
B
C
D
4
80
0
0
3
0
11
80
0
0
0
0
13
65
0
0
0
0
12
56
0
4
0
0
2
65
0
0
0
0
1
56
0
0
0
0
10
34
10
10
10
4
I want to replace values of df1 by df2 for respective ID(primary key).
Expected df1 :-
ID
Total_Count
A
B
C
D
11
80
36
30
8
6
4
80
36
30
8
6
13
65
30
24
6
5
12
56
26
21
5
4
2
65
30
24
6
5
1
56
26
21
5
4
10
34
10
10
10
4
df2=spark.read.option("header","True").option("inferSchema","True").csv("df1.csv")
df1=spark.read.option("header","True").option("inferSchema","True").csv("df2.csv")
df2 = df2.withColumnRenamed("ID",'df2_ID').withColumnRenamed("Total_Count",'df2_Total_Count')
final_df = df1.join(df2,(df1.ID == df2.df2_ID) & (df1.Total_Count == df2.df2_Total_Count),"left")
from pyspark.sql.functions import when
for i in ('A','B','C','D'):
final_df = final_df.withColumn(i, when(final_df[i] == 0, final_df["Final_{}".format(i)]).otherwise(final_df[i]))
cols = df2.columns
final_df = final_df.drop(*cols)
df = df1.join(df2.select('Final_A', 'Final_B', 'Final_C', 'Final_D'), 'ID'], 'left')
df =df.withColumn('A', coalesce(df['Final_A'],df['A'])).\
withColumn('B', coalesce(df['Final_B'],df['B'])).\
withColumn('C', coalesce(df['Final_C'],df['C'])).\
withColumn('D', coalesce(df['Final_D'],df['D']))
df1 = df.select('ID', 'Total_Count','A', 'B', 'C', 'D')
df1.show()

Pyspark dataframe conditional filter and imputation

I have a pyspark dataframe df
ID
Total_Count
A
B
C
D
Group
Name
Chain
1
56
0
0
0
0
1
Apple
Fruits1
2
65
0
0
0
0
1
Apple
Fruits1
3
72
0
0
30
0
1
Banana
Fruits1
4
80
0
0
0
0
1
Strawberry
Fruits1
5
142
58
58
14
12
1
Apple
Fruits1
6
130
63
50
9
8
1
Apple
Fruits1
7
145
74
44
17
10
1
Apple
Fruits1
8
119
54
48
8
9
1
Apple
Fruits1
11
161
71
63
16
11
1
Banana
Fruits1
12
124
54
43
19
8
1
Banana
Fruits1
I want to impute the A,B,C,D columns wherever there is 0 in A,B,C,D columns(ID 1,2,3,4).
1.) Logic : Average of GroupxName(if available) or Average of GroupxChain(if available) or at Average of Group :
Taking the example to impute ID 1,2 for demo:
Post filering for Group 1 and Name Apple, Proportion for ID 1&2 is obtained as follows( For ID 1 and 2 resp. filtering rows with similar Group as 1 and similar Name (Apple)) ,proportion is calculated as A/Total_Count, B/Total_Count and so on :
A_PROP
B_PROP
C_PROP
D_PROP
0.408451
0.408450704
0.098592
0.084507042
0.484615
0.384615385
0.069231
0.061538462
0.510345
0.303448276
0.117241
0.068965517
0.453782
0.403361345
0.067227
0.075630252
2.) Average of the above 4 rows is to be taken (for ID 1 & 2 for example).
A,B,C,D in df2 is calcualted as X_prop_avg*Total_Count.
Expected output (df2) :
ID
Total_Count
A_prop_avg
B_prop_avg
C_prop_avg
D_prop_avg
A
B
C
D
1
56
0.46429811
0.37496893
0.08807265
0.07266032
26
21
5
4
2
65
0.464298107
0.374968927
0.088072647
0.072660318
30
24
6
5
3
72
0.43823883
0.369039271
0.126302344
0.066419555
32
27
9
5
4
80
0.455611681
0.372992375
0.10081588
0.070580064
36
30
8
6

How can I select rows with specific column values from a matrix?

I have a matrix train3.
1 2 3 4 5 6 7
2 12 13 14 15 16 17
3 62 53 44 35 26 17
4 52 13 24 15 26 37
I want to select only those rows of whose 1st columns contain specific values (in my case 1 and 2).
I have tried the following,
>> train3
train3 =
1 2 3 4 5 6 7
2 12 13 14 15 16 17
3 62 53 44 35 26 17
4 52 13 24 15 26 37
>> ind1 = train3(:,1) == 1
ind1 =
1
0
0
0
>> ind2 = train3(:,1) == 2
ind2 =
0
1
0
0
>> mat1 = train3(ind1, :)
mat1 =
1 2 3 4 5 6 7
>> mat2 = train3(ind2, :)
mat2 =
2 12 13 14 15 16 17
>> mat3 = [mat1 ; mat2]
mat3 =
1 2 3 4 5 6 7
2 12 13 14 15 16 17
>>
Is there any better way to do this?
Presumably you are trying to get mat3 in a single step which you can do with:
mat3 = train3(train3(:,1)==1 | train3(:,1)==2,:)
A more general way to do this would be to use ismember to get all of the rows that match the values in a list:
train3 =[
1 2 3 4 5 6 7
2 12 13 14 15 16 17
3 62 53 44 35 26 17
4 52 13 24 15 26 37];
chooseList = [1 2];
colIndex = ismember(train3(:, 1), chooseList);
subset = train3(colIndex, :);
subset =
1 2 3 4 5 6 7
2 12 13 14 15 16 17

Matlab: how I can transform this algorithm associated with matrices manipulation?

(For my problem, I use a matrix A 4x500000. And the values of A(4,k) varies between 1 and 200).
I give here an example for a case A 4x16 and A(4,k) varies between 1 and 10.
I want first to match a name to the value from 1 to 5 (=10/2):
1 = XXY;
2 = ABC;
3 = EFG;
4 = TXG;
5 = ZPF;
My goal is to find,for a vector X, a matrix M from the matrix A:
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10]
A(4,k) takes all values between 1 and 10. These values can be repeated and they all appear on the 4th line.
20
X= 32 =A(1:3,1)=A(1:3,6)=A(1:3,8)=A(1:3,9)=A(1:3,12)=A(1:3,16)
40
A(4,1) = 2;
A(4,6) = 5;
A(4,8) = 1;
A(4,9) = 3;
A(4,12) = 6;
A(4,16) = 10;
for A(4,k) corresponding to X, I associate 2 if A(4,k)<= 5, and 1 if A(4,k)> 5. For the rest of the value of A(4,k) which do not correspond to X, I associate 0:
[ 1 2 3 4 5 %% value of the fourth line of A between 1 and 5
2 2 2 0 2
ZX = 6 7 8 9 10 %% value of the fourth line of A between 6 and 10
1 0 0 0 1
2 2 2 0 2 ] %% = max(ZX(2,k),ZX(4,k))
the ultimate goal is to find the matrix M:
M = [ 1 2 3 4 5
XXY ABC EFG TXG ZPF
2 2 2 0 2 ] %% M(3,:)=ZX(5,:)
Code -
%// Assuming A, X and names to be given to the solution
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10];
X = [20 ; 32 ; 40];
names = {'XXY','ABC','EFG','TXG','ZPF'};
limit = 10; %// The maximum limit of A(4,:). Edit this to 200 for your actual case
%// Find matching 4th row elements
matches = A(4,ismember(A(1:3,:)',X','rows'));
%// Matches are compared against all possible numbers between 1 and limit
matches_pos = ismember(1:limit,matches);
%// Finally get the line 3 results of M
vals = max(2*matches_pos(1:limit/2),matches_pos( (limit/2)+1:end ));
Output -
vals =
2 2 2 0 2
For a better way to present the results, you can use a struct -
M_struct = cell2struct(num2cell(vals),names,2)
Output -
M_struct =
XXY: 2
ABC: 2
EFG: 2
TXG: 0
ZPF: 2
For writing the results to a text file -
output_file = 'results.txt'; %// Edit if needed to be saved to a different path
fid = fopen(output_file, 'w+');
for ii=1:numel(names)
fprintf(fid, '%d %s %d\n',ii, names{ii},vals(ii));
end
fclose(fid);
Text contents of the text file would be -
1 XXY 2
2 ABC 2
3 EFG 2
4 TXG 0
5 ZPF 2
A bsxfun() based approach.
Suppose your inputs are (where N can be set to 200):
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10]
X = [20; 32; 40]
N = 10;
% Match first 3 rows and return 4th
idxA = all(bsxfun(#eq, X, A(1:3,:)));
Amatch = A(4,idxA);
% Match [1:5; 5:10] to 4th row
idxZX = ismember([1:N/2; N/2+1:N], Amatch)
idxZX =
1 1 1 0 1
1 0 0 0 1
% Return M3
M3 = max(bsxfun(#times, idxZX, [2;1]))
M3 =
2 2 2 0 2

How to calculate intensity inhomogeneity based on average filter by matlab

I have a question about intensity inhomogeneity. I read a paper, it defined a way to calculate the intensity inhomogeneity based on average filter:
Let see my problem, I have a image I (below code) and a average filter with r=3. I want to calculate image transformation J based on formula (17). Could you help me to implement it by matlab code? Thank you so much.
This is my code
%Create image I
I=[3 5 5 2 0 0 6 13 1
0 3 7 5 0 0 2 8 6
4 5 5 4 2 1 3 5 9
17 10 3 1 3 7 9 9 0
7 25 0 0 5 0 10 13 2
111 105 25 19 13 11 11 8 0
103 105 15 26 0 12 2 6 0
234 238 144 140 51 44 7 8 8
231 227 150 146 43 50 8 16 9
];
%% Create filter AF
size=3; % scale parameter in Average kernel
AF=fspecial('average',[size,size]); % Average kernel
%%How to calculate CN and J
CN=mean(I(:));%Correct?
J=???
You're pretty close! The mean intensity is calculated correctly; all you are missing to calculate J is apply the filter defined with fspecial to your image:
Here is the code:
clc
clear
%Create image I
I=[3 5 5 2 0 0 6 13 1
0 3 7 5 0 0 2 8 6
4 5 5 4 2 1 3 5 9
17 10 3 1 3 7 9 9 0
7 25 0 0 5 0 10 13 2
111 105 25 19 13 11 11 8 0
103 105 15 26 0 12 2 6 0
234 238 144 140 51 44 7 8 8
231 227 150 146 43 50 8 16 9
];
% Create filter AF
size=3; % scale parameter in Average kernel
AF=fspecial('average',[size,size]); % Average kernel
%%How to calculate CN and J
CN=mean(I(:)); % This is correct
J = (CN*I)./imfilter(I,AF); % Apply the filter to the image
figure;
subplot(1,2,1)
image(I)
subplot(1,2,2)
image(J)
Resulting in the following: