How to build chain of segments in scope of pyspark dataframe - pyspark

I have a huge pyspark dataframe with segments and their subsegments, like this:
SegmentId SubSegmentStart SubSegmentEnd
1 a1 a2
1 a2 a3
2 b1 b2
3 c1 c2
3 c3 c4
3 c2 c3
I need to group records by SegmentId and add new column index to build chain of subsegments using start and end points. I need to do it for each Segment.
So I need to get the following dataframe:
SegmentId SubSegmentStart SubSegmentEnd Index
1 a1 a2 0
1 a2 a3 1
2 b1 b2 0
3 c1 c2 0
3 c3 c4 2
3 c2 c3 1
How can I do it by PySpark?

Related

Laravel Eloquent: Join two tables

I have a problem with an eloquent statement.
I have two tables
Table 1 (ORDER_DETAILS)
FORM_ID
ORDER_ID
DETAIL_ID
VALUE
A1
X1
B1
Test
A1
X1
B3
10;20
A2
X2
B10
Test 2
A2
X2
B20
1;2
A2
X2
B30
A;B;C
A3
X3
B200
X;Y;Z
A3
X3
B300
Test 3
A4
X4
B1000
L;M;O
Table 2 (FORM_DETAIL)
FORM_ID
DETAIL_ID
MOD
A1
B1
Text
A1
B2
Input
A1
B3
Select
A2
B10
Input
A2
B20
Select
A2
B30
Select
A3
B100
Text
A3
B200
Select
A3
B300
Input
A4
B1000
Select
A4
B2000
Text
A4
B3000
Text
A4
B4000
Input
A4
B5000
Input
Now I would like to bring them together for example when I call the ORDER_ID X1.
FORM_ID
ORDER_ID
DETAIL_ID
VALUE
A1
X1
B1
Test
A1
X1
B2
null
A1
X1
B3
10;20
or the ORDER_ID X4
FORM_ID
ORDER_ID
DETAIL_ID
VALUE
A4
X4
B1000
L;M;O
A4
X4
B2000
null
A4
X4
B3000
null
A4
X4
B4000
null
A4
X4
B5000
null
All values from the table 2 (FORM_DETAIL) should always be displayed and
check whether a VALUE is intended in the table ORDER_ID X4. It is important
that FORM_ID and ORDER_ID are always identical in the
are always identical.
Can you maybe help me?
Sorry, I am still Laravel beginner :o)

Get additional column using functional select

How to get an additional column of type string using ??
I tried this:
t:([]c1:`a`b`c;c2:1 2 3)
?[t;();0b;`c1`c2`c3!(`c1;`c2;10)] / ok
?[t;();0b;`c1`c2`c3!(`c1;`c2;enlist(`abc))] / ok
?[t;();0b;`c1`c2`c3!(`c1;`c2;"10")] / 'length
?[t;();0b;`c1`c2`c3!(`c1;`c2;enlist("10"))] / 'length
but got 'length error.
Your first case works because an atom will automatically expand to the required length. For a compound column you'll need to explicitly generate the correct length as follows
q)select c1,c2,c3:`abc,c4:10,c5:count[i]#enlist"abc" from t
c1 c2 c3 c4 c5
------------------
a 1 abc 10 "abc"
b 2 abc 10 "abc"
c 3 abc 10 "abc"
// in functional form
q)?[t;();0b;`c1`c2`c3!(`c1;`c2;(#;(count;`i);(enlist;"abc")))]
c1 c2 c3
-----------
a 1 "abc"
b 2 "abc"
c 3 "abc"
Jason

how to create customised user defined aggregate distinct function

I have a dataframe which contains 4 columns.
Dataframe sample
id1 id2 id3 id4
---------------
a1 a2 a3 a4
b1 b2 b3 b4
b1 b2 b3 b4
c1 c2 c3 c4
b2
c1
a3
a4
c1
d4
There are 2 types of data in a row either all the columns have data or only one column.
I want to perform distinct function on all the columns such as while comparing the values between rows, it will only compare the value which is present in a row and don't consider the null values.
Output dataframe should be
id1 id2 id3 id4
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d4
I have looked multiple examples of UDAF in spark. But not able to modified according.
you can use filter for all the columns as below
df.filter($"id1" =!= "" && $"id2" =!= "" && $"id3" =!= "" && $"id4" =!= "")
and you should get your final dataframe.
The above code is for static four columned dataframe. If you have more than four columns above method would become hectic as you would have to write too many logic checkings.
the solution to that would be to use a udf function as below
import org.apache.spark.sql.functions._
def checkIfNull = udf((co : mutable.WrappedArray[String]) => !(co.contains(null) || co.contains("")))
df.filter(checkIfNull(array(df.columns.map(col): _*))).show(false)
I hope the answer is helpful
It is possible to take advantage of that dropDuplicates is order dependent to solve this, see the answer here. However, it is not very efficient, there should be a more efficient solution.
First remove all duplicates with distinct(), then iteratively order by each column and drop it's duplicates. The columns are ordered in descending order as nulls then will be put last.
Example with four static columns:
val df2 = df.distinct()
.orderBy($"id1".desc).dropDuplicates("id1")
.orderBy($"id2".desc).dropDuplicates("id2")
.orderBy($"id3".desc).dropDuplicates("id3")
.orderBy($"id4".desc).dropDuplicates("id4")

How to parse empty columns getting displaced within TSV?

I have a TSV file that I want to parse. There are empty fields in all columns resulting in displacement of the order of columns, so that not all the values I get using a certain column number actually come from that column.
Some fields contain long strings with empty space inside them. Also, some columns contain potential delimiters like ; | :
Input file
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
Desired output
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
$file myfile
`ASCII English text, with very long lines`
$awk '-F\t' '{print NF}' myfile | sort | uniq -c | tail -n
`247871 136`
I have found this code posted in reply to a similar question (https://unix.stackexchange.com/questions/29023/how-to-display-tsv-csv-in-console-when-empty-cells-are-missed-by-column-t), but I cannot make this work for my file:
sed ':x s/\(^\|\t\)\t/\1 \t/; t x' < file.tsv | column -t -s $'\t'
(The problem persists after importing into Excel.)
FieldEmpty=' '
FieldSize=${#FieldEmpty}
sed "
s/A/&/
t B
s/^ */ ${FieldEmpty}/
t B
: B
s/B/&/
t C
s/^ .\{${FieldSize}\}/&${FieldEmpty}/
t C
: C
s/C/&/
t D
s/^ \(.\{${FieldSize}\}\)\{2\}/&${FieldEmpty}/
t D
: D
s/D/&/
t
s/^ \(.\{${FieldSize}\}\)\{3\}/&${FieldEmpty}/
" YourFile
If more column are used, an iterative way is to be used instead (same concept of test/"insert")
On my AIX/KSH (so should be the same as with --posix -e on GNU sed)
$ cat YourFile
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
$ FieldEmpty=' ';FieldSize=${#FieldEmpty};echo $FieldSize
11
$sed "..." YourFile
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
If your file is tab separated, you should use tab as a field separator in awk. Like
$ column -t -s $'\t' file
columnA columnB columnC columnD
A1 1 B1 2 C1 3 D1 4
B2 2 C2 4 D2 4
A3 1 D3 4
A4 1 B4 2 D4 4
$xxd file
0000000: 636f 6c75 6d6e 4109 636f 6c75 6d6e 4209 columnA.columnB.
0000010: 636f 6c75 6d6e 4309 636f 6c75 6d6e 440a columnC.columnD.
0000020: 4131 2031 0942 3120 3209 4331 2033 0944 A1 1.B1 2.C1 3.D
0000030: 3120 340a 0942 3220 3209 4332 2034 0944 1 4..B2 2.C2 4.D
0000040: 3220 340a 4133 2031 0909 0944 3320 340a 2 4.A3 1...D3 4.
0000050: 4134 2031 0942 3420 3209 0944 3420 340a A4 1.B4 2..D4 4.
$ awk -F'\t' '{
for (i=1; i<=NF; i++) {
printf "%-8s ", $i
}
print ""
}'
columnA columnB columnC columnD
A1 1 B1 2 C1 3 D1 4
B2 2 C2 4 D2 4
A3 1 D3 4
A4 1 B4 2 D4 4

Multiple level grouping in Crystal Reports

I have a report with the 4 columns,
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row2 A1 B1 C1 D2
Row3 A1 B1 C1 D1
Row4 A1 B1 C1 D2
Row5 A1 B1 C1 D1
I did like grouping based on the 4 columns, but i got output with space for every row.
But here in this report i would like to get the ouput as,
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row2 A1 B1 C1 D2
<-------------an empty space ----------->
Row3 A1 B1 C1 D1
Row4 A1 B1 C1 D2
<-------------an empty space ----------->
Row5 A1 B1 C1 D1
How can i achieve the above output?
A standard group by would sort the record like this:
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row3 A1 B1 C1 D1
Row5 A1 B1 C1 D1
Row2 A1 B1 C1 D2
Row4 A1 B1 C1 D2
Since you don't have a standard grouping, another approach may work. You basically want a blank line after the D2 value. This will only work if you always have D2 values at the end of a group.
Create a new blank detail section under the main section
Detail one
A1 B1 C1 D1
Detail two
<blank>
Then put a conditional suppress expression on detail two
ColumnD <> "D2"
Then whenever D2 is present the blank detail section will be displayed.
You can use a Formula instead of a field Value for grouping.
select Column4 <br>
case D1 : "Group1"<br>
case D2 : "Group2"<br>
case D3 : "Group3"<br>
case D4 : "Group3"<br>
case D5 : "Group3"<br>
case D6 : "Group4"<br>
default "Group5"<br>
Is that your problem ?
The blank lines can be generated as Group Footer.