Pivot table and onehot in pyspark - pyspark

I have a pyspark data frame which looks like -
id age cost gender
1 38 230 M
2 40 832 M
3 53 987 F
1 38 764 M
4 63 872 F
5 21 763 F
I want my data frame looks like -
id age cost gender M F
1 38 230 M 1 0
2 40 832 M 1 0
3 53 987 F 0 1
1 38 764 M 1 0
4 63 872 F 0 1
5 21 763 F 0 1
4 63 1872 F 0 1
Using python I can manage in following way -
final_df = pd.concat([df.drop(['gender'], axis=1), pd.get_dummies(df['gender'])], axis=1)
How to manage in pyspark?

just need to add 2 columns :
from pyspark.sql import functions as F
final_df = df.select(
"id",
"age",
"cost",
"gender",
F.when(F.col("gender")==F.lit("M"),1).otherwise(0).alias("M"),
F.when(F.col("gender")==F.lit("F"),1).otherwise(0).alias("F"),
)

Related

Replace values of one pyspark dataframe with another

I have a pyspark dataframe df2 :-
ID
Total_Count
Final_A
Final_B
Final_C
Final_D
11
80
36
30
8
6
4
80
36
30
8
6
13
65
30
24
6
5
12
56
26
21
5
4
2
65
30
24
6
5
1
56
26
21
5
4
I have another dataframe df1 :-
ID
Total_Count
A
B
C
D
4
80
0
0
3
0
11
80
0
0
0
0
13
65
0
0
0
0
12
56
0
4
0
0
2
65
0
0
0
0
1
56
0
0
0
0
10
34
10
10
10
4
I want to replace values of df1 by df2 for respective ID(primary key).
Expected df1 :-
ID
Total_Count
A
B
C
D
11
80
36
30
8
6
4
80
36
30
8
6
13
65
30
24
6
5
12
56
26
21
5
4
2
65
30
24
6
5
1
56
26
21
5
4
10
34
10
10
10
4
df2=spark.read.option("header","True").option("inferSchema","True").csv("df1.csv")
df1=spark.read.option("header","True").option("inferSchema","True").csv("df2.csv")
df2 = df2.withColumnRenamed("ID",'df2_ID').withColumnRenamed("Total_Count",'df2_Total_Count')
final_df = df1.join(df2,(df1.ID == df2.df2_ID) & (df1.Total_Count == df2.df2_Total_Count),"left")
from pyspark.sql.functions import when
for i in ('A','B','C','D'):
final_df = final_df.withColumn(i, when(final_df[i] == 0, final_df["Final_{}".format(i)]).otherwise(final_df[i]))
cols = df2.columns
final_df = final_df.drop(*cols)
df = df1.join(df2.select('Final_A', 'Final_B', 'Final_C', 'Final_D'), 'ID'], 'left')
df =df.withColumn('A', coalesce(df['Final_A'],df['A'])).\
withColumn('B', coalesce(df['Final_B'],df['B'])).\
withColumn('C', coalesce(df['Final_C'],df['C'])).\
withColumn('D', coalesce(df['Final_D'],df['D']))
df1 = df.select('ID', 'Total_Count','A', 'B', 'C', 'D')
df1.show()

Pyspark dataframe conditional filter and imputation

I have a pyspark dataframe df
ID
Total_Count
A
B
C
D
Group
Name
Chain
1
56
0
0
0
0
1
Apple
Fruits1
2
65
0
0
0
0
1
Apple
Fruits1
3
72
0
0
30
0
1
Banana
Fruits1
4
80
0
0
0
0
1
Strawberry
Fruits1
5
142
58
58
14
12
1
Apple
Fruits1
6
130
63
50
9
8
1
Apple
Fruits1
7
145
74
44
17
10
1
Apple
Fruits1
8
119
54
48
8
9
1
Apple
Fruits1
11
161
71
63
16
11
1
Banana
Fruits1
12
124
54
43
19
8
1
Banana
Fruits1
I want to impute the A,B,C,D columns wherever there is 0 in A,B,C,D columns(ID 1,2,3,4).
1.) Logic : Average of GroupxName(if available) or Average of GroupxChain(if available) or at Average of Group :
Taking the example to impute ID 1,2 for demo:
Post filering for Group 1 and Name Apple, Proportion for ID 1&2 is obtained as follows( For ID 1 and 2 resp. filtering rows with similar Group as 1 and similar Name (Apple)) ,proportion is calculated as A/Total_Count, B/Total_Count and so on :
A_PROP
B_PROP
C_PROP
D_PROP
0.408451
0.408450704
0.098592
0.084507042
0.484615
0.384615385
0.069231
0.061538462
0.510345
0.303448276
0.117241
0.068965517
0.453782
0.403361345
0.067227
0.075630252
2.) Average of the above 4 rows is to be taken (for ID 1 & 2 for example).
A,B,C,D in df2 is calcualted as X_prop_avg*Total_Count.
Expected output (df2) :
ID
Total_Count
A_prop_avg
B_prop_avg
C_prop_avg
D_prop_avg
A
B
C
D
1
56
0.46429811
0.37496893
0.08807265
0.07266032
26
21
5
4
2
65
0.464298107
0.374968927
0.088072647
0.072660318
30
24
6
5
3
72
0.43823883
0.369039271
0.126302344
0.066419555
32
27
9
5
4
80
0.455611681
0.372992375
0.10081588
0.070580064
36
30
8
6

Groupby with Pyspark through filters

I have a df derived from clustering that looks like this:
Cluster
Variable 1
Variable 2
0
334
32
0
0
45
3
453
0
3
320
0
0
0
28
1
467
49
3
324
16
1
58
2
And i'm trying to achive the next result for each cluster and every variable:
Variable 1
Cluster
%of0
%ofvals != 0
Count of vals != 0
Sum of values
%universe
0
67
33
1
334
17
1
0
100
2
525
27
3
0
100
3
1097
56
Variable 2
Cluster
%of0
%ofvals != 0
Count of vals != 0
Sum of values
%universe
0
0
100
0
105
61
1
0
100
0
51
29
3
67
33
1
16
10
Note: % universe is the total sum of values of every variable, in this case for variable 1 would be: 334 + 525 + 1097 = 1956 (this is 100% so 334 its 17% of this total).
I'm in the process of learning Pyspark and I'm struggling with the syntax, this is the code I'm trying but i'm at loss because I don´t know how to manage the filterings to iterate for variable and for cluster:
for i in list_of_variables:
print(i)
df.groupBy('Cluster').agg((count((col(i) == 0) / df.filter(col('Cluster') == 0).count()) * 100).alias('% of 0'), (count((col(i) != 0) / df.filter(col('Cluster') == 0).count() * 100).alias('% of vals diff than 0')..
I would be very grateful for any ideas that could give me light on how to materialize this objective. Have an awesome day!
Maybe you could try with something like this to obtain the part of counts:
for i in list:
print(i)
output = df.filter(col(i) != 0).groupBy(col('Cluster')).agg(
count(col('*')).alias('Count_vals_dif_0')).show()

How to extract the details from image using co-ordinates

import time
import cv2
import pytesseract
import numpy as np
import pdf2image
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
x_axis = 2400
y_axis = 2700
pdf = pdf2image.convert_from_path(pdf_path='E:\\Rebecca\\VR115485 - 82520940 - NUCOR STEEL TUSCALOOSA - 400131.pdf',poppler_path='E:\\Ajai Krishna\\propeler\\poppler-0.68.0\\bin')
for _n in range(0, len(pdf)):
try:
img = pdf[_n].resize((x_axis, y_axis))
bag_of_words = []
clusters_coordinates = []
img_np = np.zeros([100, 100])
img_graph = cv2.resize(img_np, (x_axis, y_axis))
img = np.asarray(img)
# #img = cv2.medianBlur(img, 5)
text = str(pytesseract.image_to_string(img))
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
box = pytesseract.image_to_data(img_gray)
print(box)
label = "check no"
for index, b in enumerate(box.splitlines()):
if index != 0:
b = b.split()
if len(b) == 12:
x, y, w, h = (b[6]), int(b[7]), int(b[8]), int(b[9])
except:
pass
From box I get the solution:
5 1 2 1 2 6 1851 153 141 45 96.836044 Check
5 1 2 1 2 7 2018 156 56 44 96.992538 No
5 1 2 1 3 7 1852 220 167 43 92.319000 400131
5 1 2 1 2 2 301 155 112 43 38.483887 Name
5 1 2 1 3 3 300 220 141 43 57.061188 NOCOR
5 1 2 1 3 4 472 211 141 51 11.992462 STERI
5 1 2 1 3 5 640 183 307 135 58.077271 TUSCALOOSA
5 1 2 1 4 2 80 348 210 46 49.844437 VOUCHER
5 1 2 1 5 1 33 474 235 61 23.609283 ‘vRi15404
5 1 2 1 6 1 33 528 239 55 15.245552 “VRLI5485
5 1 3 1 1 1 51 605 222 42 38.442249 VR195486
5 1 2 1 4 3 293 315 263 78 4.121895 REFFRENCK
5 1 2 1 5 2 304 435 222 95 62.667671 82520840
5 1 2 1 6 2 304 540 222 43 89.974838 82520940
5 1 3 1 1 2 303 599 223 48 91.218178 82521040
Required solution:
Check No: 400131 ;;
VOUCHER : ‘vRi15404, “VRLI5485, VR195486 ;;
REFFRENCK: 82520840, 82520940, 82521040
Is there any solution to find the particular details based on the coordinates of the words using python tesseract

kdb - KDB Apply logic where column exists - data validation

I'm trying to perform some simple logic on a table but I'd like to verify that the columns exists prior to doing so as a validation step. My data consists of standard table names though they are not always present in each data source.
While the following seems to work (just validating AAA at present) I need to expand to ensure that PRI_AAA (and eventually many other variables) is present as well.
t: $[`AAA in cols `t; temp: update AAA_VAL: AAA*AAA_PRICE from t;()]
Two part question
This seems quite tedious for each variable (imagine AAA-ZZZ inputs and their derivatives). Is there a clever way to leverage a dictionary (or table) to see if a number of variables exists or insert a place holder column of zeros if they do not?
Similarly, can we store a formula or instructions to to apply within a dictionary (or table) to validate and return a calculation (i.e. BBB_VAL: BBB*BBB_PRICE.) Some calculations would be dependent on others (i.e. BBB_Tax_Basis = BBB_VAL - BBB_COSTS costs for example so there could be iterative issues.
Thank in advance!
A functional update may be the best way to achieve this if your intention is to update many columns of a table in a similar fashion.
func:{[t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")];
t;
];
};
This function will update t with *_VAL columns for any column you pass as an argument, while first also adding a zero column for any missing columns passed as an argument.
q)t:([]AAA:10?100;BBB:10?100;CCC:10?100;AAA_PRICE:10*10?10;BBB_PRICE:10*10?10;CCC_PRICE:10*10?10;DDD_PRICE:10*10?10)
q)func/[t;`AAA`BBB`CCC`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL CCC_VAL DDD DDD_VAL
---------------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0 0
39 17 97 50 90 40 10 1950 1530 3880 0 0
76 11 11 0 0 50 10 0 0 550 0 0
26 55 99 20 60 80 90 520 3300 7920 0 0
91 51 3 30 20 0 60 2730 1020 0 0 0
83 81 7 70 60 40 90 5810 4860 280 0 0
76 68 98 40 80 90 70 3040 5440 8820 0 0
88 96 30 70 0 80 80 6160 0 2400 0 0
4 61 2 70 90 0 40 280 5490 0 0 0
56 70 15 0 50 30 30 0 3500 450 0 0
As you've already mentioned, to cover point 2, a dictionary of functions might be the best way to go.
q)dict:raze{(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")}each`AAA`BBB`DDD
q)dict
AAA_VAL| * `AAA `AAA_PRICE
BBB_VAL| * `BBB `BBB_PRICE
DDD_VAL| * `DDD `DDD_PRICE
And then a slightly modified function...
func:{[dict;t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(dict`$string[x],"_VAL")];
t;
];
};
yields a similar result.
q)func[dict]/[t;`AAA`BBB`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL DDD DDD_VAL
-------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0
39 17 97 50 90 40 10 1950 1530 0 0
76 11 11 0 0 50 10 0 0 0 0
26 55 99 20 60 80 90 520 3300 0 0
91 51 3 30 20 0 60 2730 1020 0 0
83 81 7 70 60 40 90 5810 4860 0 0
76 68 98 40 80 90 70 3040 5440 0 0
88 96 30 70 0 80 80 6160 0 0 0
4 61 2 70 90 0 40 280 5490 0 0
56 70 15 0 50 30 30 0 3500 0 0
Here's another approach which handles dependent/cascading calculations and also figures out which calculations are possible or not depending on the available columns in the table.
q)show map:`AAA_VAL`BBB_VAL`AAA_RevenueP`AAA_RevenueM`BBB_Other!((*;`AAA;`AAA_PRICE);(*;`BBB;`BBB_PRICE);(+;`AAA_Revenue;`AAA_VAL);(%;`AAA_RevenueP;1e6);(reciprocal;`BBB_VAL));
AAA_VAL | (*;`AAA;`AAA_PRICE)
BBB_VAL | (*;`BBB;`BBB_PRICE)
AAA_RevenueP| (+;`AAA_Revenue;`AAA_VAL)
AAA_RevenueM| (%;`AAA_RevenueP;1000000f)
BBB_Other | (%:;`BBB_VAL)
func:{c:{$[0h=type y;.z.s[x]each y;-11h<>type y;y;y in key x;.z.s[x]each x y;y]}[y]''[y];
![x;();0b;where[{all in[;cols x]r where -11h=type each r:(raze/)y}[x]each c]#c]};
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB AAA_VAL AAA_RevenueP AAA_RevenueM
---------------------------------------------------------------
1 1 10 4 1 11 1.1e-05
2 2 20 5 4 24 2.4e-05
3 3 30 6 9 39 3.9e-05
/if the right columns are there
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6;BBB_PRICE:4 5 6f);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB BBB_PRICE AAA_VAL BBB_VAL AAA_RevenueP AAA_RevenueM BBB_Other
--------------------------------------------------------------------------------------------
1 1 10 4 4 1 16 11 1.1e-05 0.0625
2 2 20 5 5 4 25 24 2.4e-05 0.04
3 3 30 6 6 9 36 39 3.9e-05 0.02777778
The only caveat is that your map can't have the same column name as both the key and in the value of your map, aka cannot re-use column names. And it's assumed all symbols in your map are column names (not global variables) though it could be extended to cover that
EDIT: if you have a large number of column maps then it will be easier to define it in a more vertical fashion like so:
map:(!). flip(
(`AAA_VAL; (*;`AAA;`AAA_PRICE));
(`BBB_VAL; (*;`BBB;`BBB_PRICE));
(`AAA_RevenueP;(+;`AAA_Revenue;`AAA_VAL));
(`AAA_RevenueM;(%;`AAA_RevenueP;1e6));
(`BBB_Other; (reciprocal;`BBB_VAL))
);