Scala Pass window partition dataset to UDF - scala

I have a dataframe like below,
Id1
Id2
Id3
TaskId
TaskName
index
1
11
bc123-234
dfr3ws-45d
randomName1
1
1
11
bc123-234
er98d3-lkj
randomName2
2
1
11
bc123-234
hu77d9-mnb
randomName3
3
1
11
bc123-234
xc33d5-rew
deployhere4
4
1
11
xre43-876
dfr3ws-45d
randomName1
1
1
11
xre43-876
er98d3-lkj
deployhere2
2
1
11
xre43-876
hu77d9-mnb
randomName3
3
1
11
xre43-876
xc33d5-rew
randomName4
4
I partitioned the data using Id3 and Id2 and added the row_number.
I need to perform the below condition:
TaskId "hu77d9-mnb" should come before the task name which contains deploy in it. As the table suggests above the name will be random I need to read each name in the partition and see which name contains deploy in it.
if deploy taskName index is greater than taskID index then I mark the value as 1 otherwise 0.
I need to get final table like this:
Id1
Id2
Id3
TaskId
TaskName
index
result
1
11
bc123-234
dfr3ws-45d
randomName1
1
1
1
11
bc123-234
er98d3-lkj
randomName2
2
1
1
11
bc123-234
hu77d9-mnb
randomName3
3
1
1
11
bc123-234
xc33d5-rew
deployhere4
4
1
1
11
xre43-876
dfr3ws-45d
randomName1
1
0
1
11
xre43-876
er98d3-lkj
deployhere2
2
0
1
11
xre43-876
hu77d9-mnb
randomName3
3
0
1
11
xre43-876
xc33d5-rew
randomName4
4
0
I am stuck at this place how can I pass the partition data to UDF (or other functions like UDAF) and perform this task. Any suggestion will be helpful. Thank you for your time.

Index of "deploy" row and index of specific row ("hu77d9-mnb") can be assigned to each row with Window "first" function, and then just compared:
val df = Seq(
(1, 11, "bc123-234", "dfr3ws-45d", "randomName1", 1),
(1, 11, "bc123-234", "er98d3-lkj", "randomName2", 2),
(1, 11, "bc123-234", "hu77d9-mnb", "randomName3", 3),
(1, 11, "bc123-234", "xc33d5-rew", "deployhere4", 4),
(1, 11, "xre43-876", "dfr3ws-45d", "randomName1", 1),
(1, 11, "xre43-876", "er98d3-lkj", "deployhere2", 2),
(1, 11, "xre43-876", "hu77d9-mnb", "randomName3", 3),
(1, 11, "xre43-876", "xc33d5-rew", "randomName4", 4)
).toDF("Id1", "Id2", "Id3", "TaskID", "TaskName", "index")
val specificTaskId = "hu77d9-mnb"
val idsWindow = Window.partitionBy("Id1", "Id2", "Id3")
df.withColumn("deployIndex",
first(
when(instr($"TaskName", "deploy") > 0, $"index").otherwise(null),
true)
.over(idsWindow))
.withColumn("specificTaskIdIndex",
first(
when($"TaskID" === lit(specificTaskId), $"index").otherwise(null),
true)
.over(idsWindow))
.withColumn("result",
when($"specificTaskIdIndex" > $"deployIndex", 0).otherwise(1)
)
Output ("deployIndex" and "specificTaskIdIndex" columns have to be dropped):
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
|Id1|Id2|Id3 |TaskID |TaskName |index|deployIndex|specificTaskIdIndex|result|
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
|1 |11 |bc123-234|dfr3ws-45d|randomName1|1 |4 |3 |1 |
|1 |11 |bc123-234|er98d3-lkj|randomName2|2 |4 |3 |1 |
|1 |11 |bc123-234|hu77d9-mnb|randomName3|3 |4 |3 |1 |
|1 |11 |bc123-234|xc33d5-rew|deployhere4|4 |4 |3 |1 |
|1 |11 |xre43-876|dfr3ws-45d|randomName1|1 |2 |3 |0 |
|1 |11 |xre43-876|er98d3-lkj|deployhere2|2 |2 |3 |0 |
|1 |11 |xre43-876|hu77d9-mnb|randomName3|3 |2 |3 |0 |
|1 |11 |xre43-876|xc33d5-rew|randomName4|4 |2 |3 |0 |
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+

Related

Column transform in pyspark dataframe

I'm new in pyspark.
I want to do some column transforms.
My dataframe:
import pandas as pd
df = pd.DataFrame([[10, 8, 9], [ 3, 5, 4], [ 1, 3, 9], [ 1, 5, 3], [ 2, 8, 10], [ 8, 7, 9]],columns=list('ABC'))
df:
A B C
0 10 8 9
1 3 5 4
2 1 3 9
3 1 5 3
4 2 8 10
5 8 7 9
In df, each row is a triangulation, columns 'ABC' are the vertex index of the triangulations.
I want to get the dataframe of all the triangles' edges.
Under conditions:
For each edge, always lesser vertex index first.
Remove duplicate edges.
Edge[8, 9] and edge[9, 8] are seen as same edge, only remain [8,9]. (always lesser vertex index first)
My desire dataframe edge_df:
1 3
1 5
1 9
2 8
2 10
3 4
3 5
3 9
4 5
7 8
7 9
8 9
8 10
9 10
I try to join 'AB', 'AC', 'BA', 'BC', 'CA', 'CB', then distinct(), and drop() the lesser vertex index on the right column.
Is there any way more effective?
I think in this case, explode is good. orderby is not good, but I added it for the desired output
from pyspark.sql import functions as f
df.select(f.explode(f.array(f.array_sort(f.array('A', 'B')), f.array_sort(f.array('B', 'C')), f.array_sort(f.array('C', 'A')))).alias('temp')) \
.select(f.col('temp')[0].alias('a'), f.col('temp')[1].alias('b')).distinct().orderBy('a', 'b') \
.show(truncate=False)
+---+---+
|a |b |
+---+---+
|1 |3 |
|1 |5 |
|1 |9 |
|2 |8 |
|2 |10 |
|3 |5 |
|3 |9 |
|7 |8 |
|7 |9 |
|8 |9 |
|8 |10 |
|9 |10 |
+---+---+

How do I add a new column to a Spark dataframe for every row that exists?

I'm trying to create a comparison matrix using a Spark dataframe, and am starting by creating a single column dataframe with one row per value:
val df = List(1, 2, 3, 4, 5).toDF
From here, what I need to do is create a new column for each row, and insert (for now), a random number in each space, like this:
Item 1 2 3 4 5
------ --- --- --- --- ---
1 0 7 3 6 2
2 1 0 4 3 1
3 8 6 0 4 4
4 8 8 1 0 9
5 9 5 3 6 0
Any assistance would be appreciated!
Considering to transpose the input DataFrame called df using .pivot() function like the following:
val output = df.groupBy("item").pivot("item").agg((rand()*100).cast(DataTypes.IntegerType))
This will generate a new DataFrame with a random Integer value corrisponding to the row value (null otherwise).
+----+----+----+----+----+----+
|item|1 |2 |3 |4 |5 |
+----+----+----+----+----+----+
|1 |9 |null|null|null|null|
|3 |null|null|2 |null|null|
|5 |null|null|null|null|6 |
|4 |null|null|null|26 |null|
|2 |null|33 |null|null|null|
+----+----+----+----+----+----+
If you don't want the null values you can consider to apply an UDF later.

Row manipulation for Dataframe in spark [duplicate]

This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have a dataframe in spark which is like :
column_A | column_B
--------- --------
1 1,12,21
2 6,9
both column_A and column_B is of String type.
how can I convert the above dataframe to a new dataframe which is like :
colum_new_A | column_new_B
----------- ------------
1 1
1 12
1 21
2 6
2 9
both column_new_A and column_new_B should be of String type.
You need to split the Column_B with comma and use the explode function as
val df = Seq(
("1", "1,12,21"),
("2", "6,9")
).toDF("column_A", "column_B")
You can use withColumn or select to create new column.
df.withColumn("column_B", explode(split( $"column_B", ","))).show(false)
df.select($"column_A".as("column_new_A"), explode(split( $"column_B", ",")).as("column_new_B"))
Output:
+------------+------------+
|column_new_A|column_new_B|
+------------+------------+
|1 |1 |
|1 |12 |
|1 |21 |
|2 |6 |
|2 |9 |
+------------+------------+

need help to compare two columns in spark scala

I have spark dataframe like this
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
4 1 test3 3 0, 1, 2
5 3 test4 0 0, 1, 2
11 2 test Yes Yes, No
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
55 3 test4 0 0, 1, 2
val df = sqlContext.sql("select id1, id2, attrname, attr_value, attr_valuelist from dftable)
i want to check attr_value in attr_valuelist if it does not exists then take only those rows
id1 id2 attrname attr_value attr_valuelist
4 1 test3 3 0, 1, 2
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
you can simply do the following with contains in your dataframe
import org.apache.spark.sql.functions._
df.filter(!(col("attr_valuelist").contains(col("attr_value")))).show(false)
you should have following output
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|3 |2 |test2 |value1 |val1, Value1,value2|
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
If you want to ignore the case letters then you can simply user lower function as
df.filter(!(lower(col("attr_valuelist")).contains(lower(col("attr_value"))))).show(false)
you should have
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
You can define a custom function, user defined function in Spark, where you can test if a value from a column is contained in the value of the other column, like this:
def contains = udf((attr: String, attrList: String) => attrList.contains(attr))
def notContains = udf((attr: String, attrList: String) => !attrList.contains(attr))
you can tweak contains function how you want, and then you can select from your dataframe like this
df.where(contains(df("attr_value", df("attr_valuelist")))
df.where(notContains(df("attr_value", df("attr_valuelist")))

Clean dirty data

I have three variables (ID, Name and City) and need to generate a new variable flag.
There are something wrong with the observations. I need to find the wrong observations and create the flag. The variable flag indicates which column contains the wrong observation.
Suppose just one bad observation at most in each row.
Given dirty data!!!!!
|ID |Name |City
|1 |IBM |D
|1 |IBM |D
|2 |IBM |D
|3 |Google |F
|3 |Microsoft |F
|3 |Google |F
|8 |Microsoft |A
|8 |Microsoft |B
|8 |Microsoft |A
Result
|ID |Name |City |flag
|1 |IBM |D |0
|1 |IBM |D |0
|2 |IBM |D |1
|3 |Google |F |0
|3 |Microsoft |F |2
|3 |Google |F |0
|8 |Microsoft |A |0
|8 |Microsoft |B |3
|8 |Microsoft |A |0
Here is an answer in Stata that rests on many assumptions that you pointed out in the comments but not the initial quesiton:
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
// get dummy variable for
duplicates tag, gen(right)
gen flag = 0
encode Name, gen(Name_n)
encode City, gen(City_n)
qui sum
forvalues start = 1(3)`r(N)' {
local end = `start'+2
// check if ID is all same
qui sum ID in `start'/`end'
if `r(sd)' != 0 {
replace flag = 1 in `start'/`end' if right == 0
continue
}
// check if name is all same
qui sum Name_n in `start'/`end'
if `r(sd)' != 0 {
replace flag = 2 in `start'/`end' if right == 0
continue
}
// chech if city is all same
qui sum City_n in `start'/`end'
if `r(sd)' != 0 {
replace flag = 3 in `start'/`end' if right == 0
continue
}
}
drop right Name_n City_n
The intuition is that because they are grouped in 3s, two are always right, there is only one issue per group of 3, and they are sorted by ID which can be wrong but not greater than the next greatest right ID we can first check for duplicates, if there is a duplicate observation then that observation is right.
Next, (in the forvalues loop) we go through each group of three to see which of the variables has the wrong value, when we find it, we replace flag with the appropriate number.
This code is based on Eric's answer.
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
encode Name, gen(Name_n)
encode City, gen(City_n)
// get dummy variable for
duplicates tag ID Name, gen(col_12)
duplicates tag ID City, gen(col_13)
duplicates tag Name City, gen(col_23)
duplicates tag ID Name City, gen(col_123)
// generate the flag
gen flag = 0
replace flag = 1 if col_123 == 0 & col_23 ~= 0
replace flag = 2 if col_123 == 0 & col_13 ~= 0
replace flag = 3 if col_123 == 0 & col_12 ~= 0
drop Name_n City_n col_*