I have spark dataframe like this
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
4 1 test3 3 0, 1, 2
5 3 test4 0 0, 1, 2
11 2 test Yes Yes, No
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
55 3 test4 0 0, 1, 2
val df = sqlContext.sql("select id1, id2, attrname, attr_value, attr_valuelist from dftable)
i want to check attr_value in attr_valuelist if it does not exists then take only those rows
id1 id2 attrname attr_value attr_valuelist
4 1 test3 3 0, 1, 2
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
you can simply do the following with contains in your dataframe
import org.apache.spark.sql.functions._
df.filter(!(col("attr_valuelist").contains(col("attr_value")))).show(false)
you should have following output
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|3 |2 |test2 |value1 |val1, Value1,value2|
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
If you want to ignore the case letters then you can simply user lower function as
df.filter(!(lower(col("attr_valuelist")).contains(lower(col("attr_value"))))).show(false)
you should have
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
You can define a custom function, user defined function in Spark, where you can test if a value from a column is contained in the value of the other column, like this:
def contains = udf((attr: String, attrList: String) => attrList.contains(attr))
def notContains = udf((attr: String, attrList: String) => !attrList.contains(attr))
you can tweak contains function how you want, and then you can select from your dataframe like this
df.where(contains(df("attr_value", df("attr_valuelist")))
df.where(notContains(df("attr_value", df("attr_valuelist")))
Related
I'm new in pyspark.
I want to do some column transforms.
My dataframe:
import pandas as pd
df = pd.DataFrame([[10, 8, 9], [ 3, 5, 4], [ 1, 3, 9], [ 1, 5, 3], [ 2, 8, 10], [ 8, 7, 9]],columns=list('ABC'))
df:
A B C
0 10 8 9
1 3 5 4
2 1 3 9
3 1 5 3
4 2 8 10
5 8 7 9
In df, each row is a triangulation, columns 'ABC' are the vertex index of the triangulations.
I want to get the dataframe of all the triangles' edges.
Under conditions:
For each edge, always lesser vertex index first.
Remove duplicate edges.
Edge[8, 9] and edge[9, 8] are seen as same edge, only remain [8,9]. (always lesser vertex index first)
My desire dataframe edge_df:
1 3
1 5
1 9
2 8
2 10
3 4
3 5
3 9
4 5
7 8
7 9
8 9
8 10
9 10
I try to join 'AB', 'AC', 'BA', 'BC', 'CA', 'CB', then distinct(), and drop() the lesser vertex index on the right column.
Is there any way more effective?
I think in this case, explode is good. orderby is not good, but I added it for the desired output
from pyspark.sql import functions as f
df.select(f.explode(f.array(f.array_sort(f.array('A', 'B')), f.array_sort(f.array('B', 'C')), f.array_sort(f.array('C', 'A')))).alias('temp')) \
.select(f.col('temp')[0].alias('a'), f.col('temp')[1].alias('b')).distinct().orderBy('a', 'b') \
.show(truncate=False)
+---+---+
|a |b |
+---+---+
|1 |3 |
|1 |5 |
|1 |9 |
|2 |8 |
|2 |10 |
|3 |5 |
|3 |9 |
|7 |8 |
|7 |9 |
|8 |9 |
|8 |10 |
|9 |10 |
+---+---+
I have a dataframe like below,
Id1
Id2
Id3
TaskId
TaskName
index
1
11
bc123-234
dfr3ws-45d
randomName1
1
1
11
bc123-234
er98d3-lkj
randomName2
2
1
11
bc123-234
hu77d9-mnb
randomName3
3
1
11
bc123-234
xc33d5-rew
deployhere4
4
1
11
xre43-876
dfr3ws-45d
randomName1
1
1
11
xre43-876
er98d3-lkj
deployhere2
2
1
11
xre43-876
hu77d9-mnb
randomName3
3
1
11
xre43-876
xc33d5-rew
randomName4
4
I partitioned the data using Id3 and Id2 and added the row_number.
I need to perform the below condition:
TaskId "hu77d9-mnb" should come before the task name which contains deploy in it. As the table suggests above the name will be random I need to read each name in the partition and see which name contains deploy in it.
if deploy taskName index is greater than taskID index then I mark the value as 1 otherwise 0.
I need to get final table like this:
Id1
Id2
Id3
TaskId
TaskName
index
result
1
11
bc123-234
dfr3ws-45d
randomName1
1
1
1
11
bc123-234
er98d3-lkj
randomName2
2
1
1
11
bc123-234
hu77d9-mnb
randomName3
3
1
1
11
bc123-234
xc33d5-rew
deployhere4
4
1
1
11
xre43-876
dfr3ws-45d
randomName1
1
0
1
11
xre43-876
er98d3-lkj
deployhere2
2
0
1
11
xre43-876
hu77d9-mnb
randomName3
3
0
1
11
xre43-876
xc33d5-rew
randomName4
4
0
I am stuck at this place how can I pass the partition data to UDF (or other functions like UDAF) and perform this task. Any suggestion will be helpful. Thank you for your time.
Index of "deploy" row and index of specific row ("hu77d9-mnb") can be assigned to each row with Window "first" function, and then just compared:
val df = Seq(
(1, 11, "bc123-234", "dfr3ws-45d", "randomName1", 1),
(1, 11, "bc123-234", "er98d3-lkj", "randomName2", 2),
(1, 11, "bc123-234", "hu77d9-mnb", "randomName3", 3),
(1, 11, "bc123-234", "xc33d5-rew", "deployhere4", 4),
(1, 11, "xre43-876", "dfr3ws-45d", "randomName1", 1),
(1, 11, "xre43-876", "er98d3-lkj", "deployhere2", 2),
(1, 11, "xre43-876", "hu77d9-mnb", "randomName3", 3),
(1, 11, "xre43-876", "xc33d5-rew", "randomName4", 4)
).toDF("Id1", "Id2", "Id3", "TaskID", "TaskName", "index")
val specificTaskId = "hu77d9-mnb"
val idsWindow = Window.partitionBy("Id1", "Id2", "Id3")
df.withColumn("deployIndex",
first(
when(instr($"TaskName", "deploy") > 0, $"index").otherwise(null),
true)
.over(idsWindow))
.withColumn("specificTaskIdIndex",
first(
when($"TaskID" === lit(specificTaskId), $"index").otherwise(null),
true)
.over(idsWindow))
.withColumn("result",
when($"specificTaskIdIndex" > $"deployIndex", 0).otherwise(1)
)
Output ("deployIndex" and "specificTaskIdIndex" columns have to be dropped):
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
|Id1|Id2|Id3 |TaskID |TaskName |index|deployIndex|specificTaskIdIndex|result|
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
|1 |11 |bc123-234|dfr3ws-45d|randomName1|1 |4 |3 |1 |
|1 |11 |bc123-234|er98d3-lkj|randomName2|2 |4 |3 |1 |
|1 |11 |bc123-234|hu77d9-mnb|randomName3|3 |4 |3 |1 |
|1 |11 |bc123-234|xc33d5-rew|deployhere4|4 |4 |3 |1 |
|1 |11 |xre43-876|dfr3ws-45d|randomName1|1 |2 |3 |0 |
|1 |11 |xre43-876|er98d3-lkj|deployhere2|2 |2 |3 |0 |
|1 |11 |xre43-876|hu77d9-mnb|randomName3|3 |2 |3 |0 |
|1 |11 |xre43-876|xc33d5-rew|randomName4|4 |2 |3 |0 |
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
My spark dataframe looks like this:
+------+------+-------+------+
|userid|useid1|userid2|score |
+------+------+-------+------+
|23 |null |dsad |3 |
|11 |44 |null |4 |
|231 |null |temp |5 |
|231 |null |temp |2 |
+------+------+-------+------+
I want to do the calculation for each pair of userid and useid1/userid2 (whichever is not null).
And if it's useid1, I multiply the score by 5, if it's userid2, I multiply the score by 3.
Finally, I want to add all score for each pair.
The result should be:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|11 |44 |20 |
|231 |temp |21 |
+------+------+-------------+
How can I do this?
For the groupBy part, I know dataframe has the groupBy function, but I don't know if I can use it conditionally, like if userid1 is null, groupby(userid, userid2), if userid2 is null, groupby(userid, useid1).
For the calculation part, how to multiply 3 or 5 based on the condition?
The below solution will help to solve your problem.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val groupByUserWinFun = Window.partitionBy("userid","useid1/2")
val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1"))
.withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5))
.withColumn("finalscore", sum("finalscore").over(groupByUserWinFun))
.select("userid", "useid1/2", "finalscore").distinct()
using when method in spark SQL, select userid1 or 2 and multiply with values based on the condition
Output:
+------+--------+----------+
|userid|useid1/2|finalscore|
+------+--------+----------+
| 11 | 44| 20.0|
| 23 | dsad| 9.0|
| 231| temp| 21.0|
+------+--------+----------+
Group by will work:
val original = Seq(
(23, null, "dsad", 3),
(11, "44", null, 4),
(231, null, "temp", 5),
(231, null, "temp", 2)
).toDF("userid", "useid1", "userid2", "score")
// action
val result = original
.withColumn("useid1/2", coalesce($"useid1", $"userid2"))
.withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3))
.groupBy("userid", "useid1/2")
.agg(sum("score").alias("final score"))
result.show(false)
Output:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|231 |temp |21 |
|11 |44 |20 |
+------+--------+-----------+
coalesce will do the needful.
df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1")))
basically this function return first non-null value of the order
documentation :
COALESCE(T v1, T v2, ...)
Returns the first v that is not NULL, or NULL if all v's are NULL.
needs an import import org.apache.spark.sql.functions.coalesce
I'm trying to create a comparison matrix using a Spark dataframe, and am starting by creating a single column dataframe with one row per value:
val df = List(1, 2, 3, 4, 5).toDF
From here, what I need to do is create a new column for each row, and insert (for now), a random number in each space, like this:
Item 1 2 3 4 5
------ --- --- --- --- ---
1 0 7 3 6 2
2 1 0 4 3 1
3 8 6 0 4 4
4 8 8 1 0 9
5 9 5 3 6 0
Any assistance would be appreciated!
Considering to transpose the input DataFrame called df using .pivot() function like the following:
val output = df.groupBy("item").pivot("item").agg((rand()*100).cast(DataTypes.IntegerType))
This will generate a new DataFrame with a random Integer value corrisponding to the row value (null otherwise).
+----+----+----+----+----+----+
|item|1 |2 |3 |4 |5 |
+----+----+----+----+----+----+
|1 |9 |null|null|null|null|
|3 |null|null|2 |null|null|
|5 |null|null|null|null|6 |
|4 |null|null|null|26 |null|
|2 |null|33 |null|null|null|
+----+----+----+----+----+----+
If you don't want the null values you can consider to apply an UDF later.
This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have a dataframe in spark which is like :
column_A | column_B
--------- --------
1 1,12,21
2 6,9
both column_A and column_B is of String type.
how can I convert the above dataframe to a new dataframe which is like :
colum_new_A | column_new_B
----------- ------------
1 1
1 12
1 21
2 6
2 9
both column_new_A and column_new_B should be of String type.
You need to split the Column_B with comma and use the explode function as
val df = Seq(
("1", "1,12,21"),
("2", "6,9")
).toDF("column_A", "column_B")
You can use withColumn or select to create new column.
df.withColumn("column_B", explode(split( $"column_B", ","))).show(false)
df.select($"column_A".as("column_new_A"), explode(split( $"column_B", ",")).as("column_new_B"))
Output:
+------------+------------+
|column_new_A|column_new_B|
+------------+------------+
|1 |1 |
|1 |12 |
|1 |21 |
|2 |6 |
|2 |9 |
+------------+------------+