Summarizing unique values grouped by different fields in DAX Power BI - group-by

I have the following table - Tbl1:
ProgramID Domain ClientID DateIn DateOut StatusDays StatusDays_Upd
471 Res 323 09/13/2019 09/16/2019 4 4
471 Res 323 09/14/2019 09/16/2019 3 4
471 Res 323 09/15/2019 09/16/2019 2 4
471 Res 323 09/16/2019 09/16/2019 1 4
471 Res 325 08/12/2019 08/13/2019 2 2
471 Res 325 08/13/2019 08/13/2019 1 2
471 Res 318 10/10/2019 10/13/2019 4 4
471 Res 318 10/11/2019 10/13/2019 3 4
471 Res 318 10/12/2019 10/13/2019 2 4
471 Res 318 10/13/2019 10/13/2019 1 4
I need to create a Measure to summarize values of [StatusDays_Upd], grouped by
[ProgramID], [Domain], [ClientID] and [DateOut] and also to count unique [ClientID]
So, that the result would be as the following:
ProgramID Count_ClientID Total_StatusDays_Upd
471 3 10
[Total_StatusDays_Upd] should be 4+2+4=10 and must include all distinct values grouped by [DateOut]
I used the following measure for [Count_Clients]:
Count_ClientID = Distinctcount(ClientID)
But can't figure the [Total_StatusDays_Upd] measure

You can use a summarize in your measure to group by the required fields and then return the calculation on that:
[Total_StatusDays_Upd]=
SUMX(SUMMARIZE(Tbl1,
Tbl1[ProgramID],
Tbl1[DateOut],
Tbl1[ClientID],
Tbl1[StatusDays_Upd]),
Tbl1[StatusDays_Upd])

Related

Pyspark dataframe conditional filter and imputation

I have a pyspark dataframe df
ID
Total_Count
A
B
C
D
Group
Name
Chain
1
56
0
0
0
0
1
Apple
Fruits1
2
65
0
0
0
0
1
Apple
Fruits1
3
72
0
0
30
0
1
Banana
Fruits1
4
80
0
0
0
0
1
Strawberry
Fruits1
5
142
58
58
14
12
1
Apple
Fruits1
6
130
63
50
9
8
1
Apple
Fruits1
7
145
74
44
17
10
1
Apple
Fruits1
8
119
54
48
8
9
1
Apple
Fruits1
11
161
71
63
16
11
1
Banana
Fruits1
12
124
54
43
19
8
1
Banana
Fruits1
I want to impute the A,B,C,D columns wherever there is 0 in A,B,C,D columns(ID 1,2,3,4).
1.) Logic : Average of GroupxName(if available) or Average of GroupxChain(if available) or at Average of Group :
Taking the example to impute ID 1,2 for demo:
Post filering for Group 1 and Name Apple, Proportion for ID 1&2 is obtained as follows( For ID 1 and 2 resp. filtering rows with similar Group as 1 and similar Name (Apple)) ,proportion is calculated as A/Total_Count, B/Total_Count and so on :
A_PROP
B_PROP
C_PROP
D_PROP
0.408451
0.408450704
0.098592
0.084507042
0.484615
0.384615385
0.069231
0.061538462
0.510345
0.303448276
0.117241
0.068965517
0.453782
0.403361345
0.067227
0.075630252
2.) Average of the above 4 rows is to be taken (for ID 1 & 2 for example).
A,B,C,D in df2 is calcualted as X_prop_avg*Total_Count.
Expected output (df2) :
ID
Total_Count
A_prop_avg
B_prop_avg
C_prop_avg
D_prop_avg
A
B
C
D
1
56
0.46429811
0.37496893
0.08807265
0.07266032
26
21
5
4
2
65
0.464298107
0.374968927
0.088072647
0.072660318
30
24
6
5
3
72
0.43823883
0.369039271
0.126302344
0.066419555
32
27
9
5
4
80
0.455611681
0.372992375
0.10081588
0.070580064
36
30
8
6

How to extract the details from image using co-ordinates

import time
import cv2
import pytesseract
import numpy as np
import pdf2image
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
x_axis = 2400
y_axis = 2700
pdf = pdf2image.convert_from_path(pdf_path='E:\\Rebecca\\VR115485 - 82520940 - NUCOR STEEL TUSCALOOSA - 400131.pdf',poppler_path='E:\\Ajai Krishna\\propeler\\poppler-0.68.0\\bin')
for _n in range(0, len(pdf)):
try:
img = pdf[_n].resize((x_axis, y_axis))
bag_of_words = []
clusters_coordinates = []
img_np = np.zeros([100, 100])
img_graph = cv2.resize(img_np, (x_axis, y_axis))
img = np.asarray(img)
# #img = cv2.medianBlur(img, 5)
text = str(pytesseract.image_to_string(img))
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
box = pytesseract.image_to_data(img_gray)
print(box)
label = "check no"
for index, b in enumerate(box.splitlines()):
if index != 0:
b = b.split()
if len(b) == 12:
x, y, w, h = (b[6]), int(b[7]), int(b[8]), int(b[9])
except:
pass
From box I get the solution:
5 1 2 1 2 6 1851 153 141 45 96.836044 Check
5 1 2 1 2 7 2018 156 56 44 96.992538 No
5 1 2 1 3 7 1852 220 167 43 92.319000 400131
5 1 2 1 2 2 301 155 112 43 38.483887 Name
5 1 2 1 3 3 300 220 141 43 57.061188 NOCOR
5 1 2 1 3 4 472 211 141 51 11.992462 STERI
5 1 2 1 3 5 640 183 307 135 58.077271 TUSCALOOSA
5 1 2 1 4 2 80 348 210 46 49.844437 VOUCHER
5 1 2 1 5 1 33 474 235 61 23.609283 ‘vRi15404
5 1 2 1 6 1 33 528 239 55 15.245552 “VRLI5485
5 1 3 1 1 1 51 605 222 42 38.442249 VR195486
5 1 2 1 4 3 293 315 263 78 4.121895 REFFRENCK
5 1 2 1 5 2 304 435 222 95 62.667671 82520840
5 1 2 1 6 2 304 540 222 43 89.974838 82520940
5 1 3 1 1 2 303 599 223 48 91.218178 82521040
Required solution:
Check No: 400131 ;;
VOUCHER : ‘vRi15404, “VRLI5485, VR195486 ;;
REFFRENCK: 82520840, 82520940, 82521040
Is there any solution to find the particular details based on the coordinates of the words using python tesseract

Plot selected rows with the average and standard deviation (GNUPlot)

I have a csv file with experiment results that goes like this:
64 4 8 1 1 2 1 ttt 62391 4055430 333 0.0001 10 161 108 288 0
64 4 8 1 1 2 1 ttt 60966 3962810 322 0.0001 10 164 112 295 0
64 4 8 1 1 2 1 ttt 61530 3999475 325 0.0001 10 162 112 291 0
64 4 8 1 1 2 1 ttt 61430 4054428 332 0.0001 10 158 110 286 0
64 4 8 1 1 2 1 ttt 63891 4152938 339 0.0001 9 149 109 274 0
64 4 32 1 1 2 1 ttt 63699 4204182 345 0.0001 4 43 179 240 0
64 4 32 1 1 2 1 ttt 63326 4116218 336 0.0001 4 45 183 248 0
64 4 32 1 1 2 1 ttt 62654 4135211 340 0.0001 4 48 178 248 0
64 4 32 1 1 2 1 ttt 63192 4107506 339 0.0001 4 49 175 245 0
64 4 32 1 1 2 1 ttt 62707 4138666 345 0.0001 4 46 179 245 0
64 4 64 1 1 2 1 ttt 60968 3962929 323 0.0001 4 46 191 256 0
64 4 64 1 1 2 1 ttt 58765 3819787 305 0.0001 4 50 196 267 0
64 4 64 1 1 2 1 ttt 58946 3831499 308 0.0001 5 52 187 260 0
64 4 64 1 1 2 1 ttt 60646 3942047 321 0.0001 4 47 187 254 0
64 4 64 1 1 2 1 ttt 59723 3882044 311 0.0001 4 46 201 269 0
64 8 8 1 1 2 1 ttt 63414 4185382 382 0.0001 33 517 109 643 0
64 8 8 1 1 2 1 ttt 62429 4057899 372 0.0001 33 538 110 667 0
64 8 8 1 1 2 1 ttt 60622 3940452 384 0.0001 33 556 115 689 0
64 8 8 1 1 2 1 ttt 64433 4188192 369 0.0001 33 519 110 644 0
My goal is to be able to plot various combinations (choose which, in different charts) of the columns before the "ttt", with the average and standard deviation of the columns (choose which) after "ttt" (by grouping them by the before "ttt" columns).
Is this possible in GNUPlot and if yes how? If not, do you have any alternate suggestions regarding my problem?
Here is a completely revised and more general version.
Since you want to filter by 3 columns you need to have 3 properties to distinguish the data in the plot. This would be for example color, x-position and pointtype. What the script basically does:
Generates random data for testing (take your file instead)
$Data looks like this:
8 64 57773 0
4 32 64721 2
8 32 56757 1
4 16 56226 2
8 8 56055 1
8 64 59874 0
8 32 58733 0
4 16 55525 2
8 32 58869 0
8 64 64470 0
4 32 60930 1
8 64 57073 2
...
the variables ColX, ColC, ColP, and ColS define which columns are taken for x-position, color, pointtype and statistics.
find unique values of ColX, ColC, ColP, (check help smooth frequency) and put them to datablocks $ColX, $ColC, and $ColP.
put the unique values to arrays ArrX, ArrC, ArrP
loop all possible combinations and do statistics on ColS and put it to $Data2. Add 3 columns at the beginning for color, x-position and pointtype.
$Data2 looks like this:
1 1 1 0 8 4 61639.4 2788.4
1 1 2 0 8 8 59282.1 2740.2
1 2 1 0 16 4 59372.3 2808.6
1 2 2 0 16 8 60502.3 2825.0
1 3 1 0 32 4 59850.7 2603.8
1 3 2 0 32 8 60617.7 1979.8
1 4 1 0 64 4 60399.4 3273.6
1 4 2 0 64 8 59930.7 2919.8
2 1 1 1 8 4 59172.6 2288.2
2 1 2 1 8 8 58992.2 2888.0
2 2 1 1 16 4 59350.1 2364.6
2 2 2 1 16 8 61034.0 2368.5
2 3 1 1 32 4 59920.8 2867.6
2 3 2 1 32 8 59711.9 3464.2
2 4 1 1 64 4 60936.7 3439.7
2 4 2 1 64 8 61078.7 2349.3
3 1 1 2 8 4 58976.0 2376.3
3 1 2 2 8 8 61731.5 1635.7
3 2 1 2 16 4 58276.0 2101.7
3 2 2 2 16 8 58594.5 3358.5
3 3 1 2 32 4 60471.5 3737.6
3 3 2 2 32 8 59909.1 2024.0
3 4 1 2 64 4 62044.2 1446.7
3 4 2 2 64 8 60454.0 3215.1
Finally, plot the data. I couldn't figure out how plotting style with yerror works properly together with variable pointtypes. So, instead I split it into two plot commands with vectors and with points. The third one keyentry is just to get an empty line in the legend and the forth one is to get the pointtype into the legend.
I hope you can figure out all the other details and adapt it to your data.
Code:
### grouped statistics on filtered (unsorted) data
reset session
set colorsequence classic
# generate some random test data
rand1(n) = 2**(int(rand(0)*2)+2) # values 4,8
rand2(n) = 2**(int(rand(0)*4)+3) # values 8,16,32,64
rand3(n) = int(rand(0)*10000)+55000 # values 55000 to 65000
rand4(n) = int(rand(0)*3) # values 0,1,2
set print $Data
do for [i=1:200] {
print sprintf("% 3d% 4d% 7d% 3d", rand1(0), rand2(0), rand3(0), rand4(0))
}
set print
print $Data # (just for test purpose)
ColX = 2 # column for x
ColC = 4 # column for color
ColP = 1 # column for pointtype
ColS = 3 # column for statistics
# get unique values of the columns
set table $ColX
plot $Data u (column(ColX)) smooth freq
unset table
set table $ColC
plot $Data u (column(ColC)) smooth freq
unset table
set table $ColP
plot $Data u (column(ColP)) smooth freq
unset table
# put unique values into arrays
set table $Dummy
array ArrX[|$ColX|-6] # gnuplot creates 6 extra lines
array ArrC[|$ColC|-6]
array ArrP[|$ColP|-6]
plot $ColX u (ArrX[$0+1]=$1)
plot $ColC u (ArrC[$0+1]=$1)
plot $ColP u (ArrP[$0+1]=$1)
unset table
print ArrX, ArrC, ArrP # just for test purpose
# define filter function
Filter(c,x,p) = ArrX[x]==column(ColX) && ArrC[c]==column(ColC) && \
ArrP[p]==column(ColP) ? column(ColS) : NaN
# loop all values and do statistics, write data into $Data2
set print $Data2
do for [c=1:|ArrC|] {
do for [x=1:|ArrX|] {
do for [p=1:|ArrP|] {
undef var STATS*
stats $Data u (Filter(c,x,p)) nooutput
if (exists('STATS_mean') && exists('STATS_stddev')) {
print sprintf("% 3d% 3d% 3d% 3d% 3d% 3d% 9.1f % 7.1f", c, x, p, ArrC[c], ArrX[x], ArrP[p], STATS_mean, STATS_stddev)
}
}
}
print ""; print ""
}
set print
# print $Data2 # just for testing purpose
set xlabel sprintf("Column %d", ColX)
set ylabel sprintf("Column %d", ColS)
set xrange[0.5:|ArrX|+1]
set xtics () # remove all xtics
do for [x=1:|ArrX|] { set xtics add (sprintf("%d",ArrX[x]) x)} # set xtics "manually"
# function for x position and offsets,
# actually not dependent on 'n' but to shorten plot command
# columns in $Data2: 1=color, 2=x, 3=pointtype
width = 0.5 # float number!
step = width/(|ArrC|-1)
PosX(n) = column(2) - width/2.0 + step*(column(1)-1) + (column(3)-1)*step*0.3
plot \
for [c=1:|ArrC|] $Data2 u (PosX(0)):($7-$8):(0):(2*$8) index c-1 w vectors \
heads size 0.04,90 lw 2 lc c ti sprintf("%g",ArrC[c]),\
for [c=1:|ArrC|] '' u (PosX(0)):7:($3*2+4):(c) index c-1 w p ps 1.5 pt var lc var not, \
keyentry w p ps 0 ti "\n", \
for [p=1:|ArrP|] '' u (0):(NaN) w p pt p*2+4 ps 1.5 lc rgb "black" ti sprintf("%g",ArrP[p])
### end of code
Result:
I do not think gnuplot can produce exactly what you are asking for in a single plot command. I will show you two alternatives in the hope that one or both is a useful starting point.
Alternative 1: standard boxplot
spacing = 1.0
width = 0.25
unset key
set xlabel "Column 3"
set ylabel "Column 9"
plot 'data' using (spacing):9:(width):3 with boxplot lw 2
This collects points based on the value in column 3 and for each such value it produces a boxplot. This is a widely used method of showing the distribution of point values in a category, but it is a quartile analysis not a display of mean + standard deviation.
Alternative 2: calculate mean and standard deviation for categories known in advance
The boxplot analysis has the advantage that you do not need to know in advance what values may be present in column 3. Gnuplot can calculate mean and standard deviation based on a column 3 value, but you need to specify in advance what that value is. Here is a set of commands tailored to the specific example file you provided. It calculates, but does not plot, the requested categorical mean and standard deviation. You can use these numbers to construct a plot, but that will require additional commands. You could, for example, save the values for each category in a new file, or array, or datablock and then go back and plot these together.
col3entry = "8 32 64"
do for [i in col3entry] {
stats "data" using ($3 == real(i) ? $9 : NaN) name "Condition".i nooutput
print i, ": ", value("Condition".i."_mean"), value("Condition".i."_stddev")
}
output:
8: 62345.1111111111 1259.34784220021
32: 63115.6 392.552977316438
64: 59809.6 881.583711283279

Taking two separate tibbles(e.g. [[1]] and [[2]]) in data and merge?

I have a multiple step issue
1) I need to turn [[1]] "Total" -> Losing_Runs
2) I need to turn [[2]] "Total" -> Winning_Runs
3) I need to take tibble[[1]] "Loser" column and merge up with tibble[[2]] "Winner" column under a new column name labeled "Team"
4) The new tibble should be a 30x3 when compiled. The new column variables should be "Team", "Winning_Runs", "Losing_Runs"
The R doc below
[[1]]
# A tibble: 30 x 2
Loser Total
<chr> <dbl>
1 Baltimore Orioles 288
2 Kansas City Royals 278
3 Chicago White Sox 252
4 Minnesota Twins 251
5 Texas Rangers 236
6 Detroit Tigers 233
7 Miami Marlins 228
8 Cincinnati Reds 224
9 Pittsburgh Pirates 217
10 San Diego Padres 212
# ... with 20 more rows
[[2]]
# A tibble: 30 x 2
Winner Total
<chr> <dbl>
1 Boston Red Sox 694
2 Houston Astros 627
3 New York Yankees 579
4 Cleveland Indians 577
5 Chicago Cubs 572
6 Oakland Athletics 571
7 Los Angeles Dodgers 568
8 Atlanta Braves 543
9 Washington Nationals 540
10 Milwaukee Brewers 528
# ... with 20 more rows
Thank you very much for any&all help!

Embedding an array into another

I have two arrays. The first one is a consecutive sequential one, like:
seq1 =
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
...continues
The second one is like:
seq2 =
2 250
3 260
5 267
6 270
8 280
10 290
13 300
18 310
20 320
21 330
...continues
I need to embed seq2 into seq1 in such a way that I end up with the sequence:
seq3 =
1 0
2 250
3 260
4 260
5 267
6 270
7 270
8 280
9 280
10 290
11 290
... continues
I could do this with loops but the arrays are really big so I don't want to use two for loops, it is taking too long. How can I do this in a vectorised manner?
I think this does what you want:
[~, jj, vv] = find(sum(bsxfun(#le, seq2(:,1), seq1(:,1).'), 1));
seq3 = seq1;
seq3(jj,2) = seq2(vv,2);
How it works
The required index is obtained by computing how many values in the first column of seq2 are less than or equal to each value in the first column or seq1 (code sum(bsxfun(#le, ...), 1)). This will be used to select the appropriate entries from the second column of seq2 and write them into the result. But before that, the value 0 in this index needs to be discarded. This is done using the three-output version of find (code [~, jj, vv] = find(...)).
If your second column of data is always increasing, you can solve this easily with accumarray and cummax:
seq = [seq1; seq2];
seq3 = cummax(accumarray(seq(:, 1), seq(:, 2), [], #max));
seq3 = [(1:numel(seq3)).' seq3];
And here's what you get for your sample inputs:
seq3 =
1 0
2 250
3 260
4 260
5 267
6 270
7 270
8 280
9 280
10 290
11 290
12 290
13 300
14 300
15 300
16 300
17 300
18 310
19 310
20 320
21 330
How it works...
After concatenating seq1 and seq2, accumarray collects all the values in the second column that have the same value in the first column (i.e. [0 250] for the value 2), then gets the maximum value of each set. The function cummax is then used to fill any zero values with the previous non-zero value. Finally, an index column is added to the new sequence.