Column transform in pyspark dataframe - pyspark

I'm new in pyspark.
I want to do some column transforms.
My dataframe:
import pandas as pd
df = pd.DataFrame([[10, 8, 9], [ 3, 5, 4], [ 1, 3, 9], [ 1, 5, 3], [ 2, 8, 10], [ 8, 7, 9]],columns=list('ABC'))
df:
A B C
0 10 8 9
1 3 5 4
2 1 3 9
3 1 5 3
4 2 8 10
5 8 7 9
In df, each row is a triangulation, columns 'ABC' are the vertex index of the triangulations.
I want to get the dataframe of all the triangles' edges.
Under conditions:
For each edge, always lesser vertex index first.
Remove duplicate edges.
Edge[8, 9] and edge[9, 8] are seen as same edge, only remain [8,9]. (always lesser vertex index first)
My desire dataframe edge_df:
1 3
1 5
1 9
2 8
2 10
3 4
3 5
3 9
4 5
7 8
7 9
8 9
8 10
9 10
I try to join 'AB', 'AC', 'BA', 'BC', 'CA', 'CB', then distinct(), and drop() the lesser vertex index on the right column.
Is there any way more effective?

I think in this case, explode is good. orderby is not good, but I added it for the desired output
from pyspark.sql import functions as f
df.select(f.explode(f.array(f.array_sort(f.array('A', 'B')), f.array_sort(f.array('B', 'C')), f.array_sort(f.array('C', 'A')))).alias('temp')) \
.select(f.col('temp')[0].alias('a'), f.col('temp')[1].alias('b')).distinct().orderBy('a', 'b') \
.show(truncate=False)
+---+---+
|a |b |
+---+---+
|1 |3 |
|1 |5 |
|1 |9 |
|2 |8 |
|2 |10 |
|3 |5 |
|3 |9 |
|7 |8 |
|7 |9 |
|8 |9 |
|8 |10 |
|9 |10 |
+---+---+

Related

Scala Pass window partition dataset to UDF

I have a dataframe like below,
Id1
Id2
Id3
TaskId
TaskName
index
1
11
bc123-234
dfr3ws-45d
randomName1
1
1
11
bc123-234
er98d3-lkj
randomName2
2
1
11
bc123-234
hu77d9-mnb
randomName3
3
1
11
bc123-234
xc33d5-rew
deployhere4
4
1
11
xre43-876
dfr3ws-45d
randomName1
1
1
11
xre43-876
er98d3-lkj
deployhere2
2
1
11
xre43-876
hu77d9-mnb
randomName3
3
1
11
xre43-876
xc33d5-rew
randomName4
4
I partitioned the data using Id3 and Id2 and added the row_number.
I need to perform the below condition:
TaskId "hu77d9-mnb" should come before the task name which contains deploy in it. As the table suggests above the name will be random I need to read each name in the partition and see which name contains deploy in it.
if deploy taskName index is greater than taskID index then I mark the value as 1 otherwise 0.
I need to get final table like this:
Id1
Id2
Id3
TaskId
TaskName
index
result
1
11
bc123-234
dfr3ws-45d
randomName1
1
1
1
11
bc123-234
er98d3-lkj
randomName2
2
1
1
11
bc123-234
hu77d9-mnb
randomName3
3
1
1
11
bc123-234
xc33d5-rew
deployhere4
4
1
1
11
xre43-876
dfr3ws-45d
randomName1
1
0
1
11
xre43-876
er98d3-lkj
deployhere2
2
0
1
11
xre43-876
hu77d9-mnb
randomName3
3
0
1
11
xre43-876
xc33d5-rew
randomName4
4
0
I am stuck at this place how can I pass the partition data to UDF (or other functions like UDAF) and perform this task. Any suggestion will be helpful. Thank you for your time.
Index of "deploy" row and index of specific row ("hu77d9-mnb") can be assigned to each row with Window "first" function, and then just compared:
val df = Seq(
(1, 11, "bc123-234", "dfr3ws-45d", "randomName1", 1),
(1, 11, "bc123-234", "er98d3-lkj", "randomName2", 2),
(1, 11, "bc123-234", "hu77d9-mnb", "randomName3", 3),
(1, 11, "bc123-234", "xc33d5-rew", "deployhere4", 4),
(1, 11, "xre43-876", "dfr3ws-45d", "randomName1", 1),
(1, 11, "xre43-876", "er98d3-lkj", "deployhere2", 2),
(1, 11, "xre43-876", "hu77d9-mnb", "randomName3", 3),
(1, 11, "xre43-876", "xc33d5-rew", "randomName4", 4)
).toDF("Id1", "Id2", "Id3", "TaskID", "TaskName", "index")
val specificTaskId = "hu77d9-mnb"
val idsWindow = Window.partitionBy("Id1", "Id2", "Id3")
df.withColumn("deployIndex",
first(
when(instr($"TaskName", "deploy") > 0, $"index").otherwise(null),
true)
.over(idsWindow))
.withColumn("specificTaskIdIndex",
first(
when($"TaskID" === lit(specificTaskId), $"index").otherwise(null),
true)
.over(idsWindow))
.withColumn("result",
when($"specificTaskIdIndex" > $"deployIndex", 0).otherwise(1)
)
Output ("deployIndex" and "specificTaskIdIndex" columns have to be dropped):
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
|Id1|Id2|Id3 |TaskID |TaskName |index|deployIndex|specificTaskIdIndex|result|
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+
|1 |11 |bc123-234|dfr3ws-45d|randomName1|1 |4 |3 |1 |
|1 |11 |bc123-234|er98d3-lkj|randomName2|2 |4 |3 |1 |
|1 |11 |bc123-234|hu77d9-mnb|randomName3|3 |4 |3 |1 |
|1 |11 |bc123-234|xc33d5-rew|deployhere4|4 |4 |3 |1 |
|1 |11 |xre43-876|dfr3ws-45d|randomName1|1 |2 |3 |0 |
|1 |11 |xre43-876|er98d3-lkj|deployhere2|2 |2 |3 |0 |
|1 |11 |xre43-876|hu77d9-mnb|randomName3|3 |2 |3 |0 |
|1 |11 |xre43-876|xc33d5-rew|randomName4|4 |2 |3 |0 |
+---+---+---------+----------+-----------+-----+-----------+-------------------+------+

How do I add a new column to a Spark dataframe for every row that exists?

I'm trying to create a comparison matrix using a Spark dataframe, and am starting by creating a single column dataframe with one row per value:
val df = List(1, 2, 3, 4, 5).toDF
From here, what I need to do is create a new column for each row, and insert (for now), a random number in each space, like this:
Item 1 2 3 4 5
------ --- --- --- --- ---
1 0 7 3 6 2
2 1 0 4 3 1
3 8 6 0 4 4
4 8 8 1 0 9
5 9 5 3 6 0
Any assistance would be appreciated!
Considering to transpose the input DataFrame called df using .pivot() function like the following:
val output = df.groupBy("item").pivot("item").agg((rand()*100).cast(DataTypes.IntegerType))
This will generate a new DataFrame with a random Integer value corrisponding to the row value (null otherwise).
+----+----+----+----+----+----+
|item|1 |2 |3 |4 |5 |
+----+----+----+----+----+----+
|1 |9 |null|null|null|null|
|3 |null|null|2 |null|null|
|5 |null|null|null|null|6 |
|4 |null|null|null|26 |null|
|2 |null|33 |null|null|null|
+----+----+----+----+----+----+
If you don't want the null values you can consider to apply an UDF later.

Row manipulation for Dataframe in spark [duplicate]

This question already has an answer here:
How to flatmap a nested Dataframe in Spark
(1 answer)
Closed 4 years ago.
I have a dataframe in spark which is like :
column_A | column_B
--------- --------
1 1,12,21
2 6,9
both column_A and column_B is of String type.
how can I convert the above dataframe to a new dataframe which is like :
colum_new_A | column_new_B
----------- ------------
1 1
1 12
1 21
2 6
2 9
both column_new_A and column_new_B should be of String type.
You need to split the Column_B with comma and use the explode function as
val df = Seq(
("1", "1,12,21"),
("2", "6,9")
).toDF("column_A", "column_B")
You can use withColumn or select to create new column.
df.withColumn("column_B", explode(split( $"column_B", ","))).show(false)
df.select($"column_A".as("column_new_A"), explode(split( $"column_B", ",")).as("column_new_B"))
Output:
+------------+------------+
|column_new_A|column_new_B|
+------------+------------+
|1 |1 |
|1 |12 |
|1 |21 |
|2 |6 |
|2 |9 |
+------------+------------+

Spark scala join RDD between 2 datasets

Supposed i have two dataset as following:
Dataset 1:
id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68
Dataset 2:
id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great
I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:
Bill, 4
Ramond, 2
..
..
I managed to join both of them together but not sure how to count how many time it appear for each person.
Any suggestion would be appreciated.
Edited:
my join code:
val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2)
})
Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes.
First step would be to read the two files you have into dataframes as below
val df1 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
so that you should be having
df1
+---+------+-----+
|id |name |score|
+---+------+-----+
|1 |Bill |200 |
|2 |Bew |23 |
|3 |Amy |44 |
|4 |Ramond|68 |
+---+------+-----+
df2
+---+-------------------+
|id |message |
+---+-------------------+
|1 |i love Bill |
|2 |i hate Bill |
|3 |Bew go go ! |
|4 |Amy is the best |
|5 |Ramond is the wrost|
|6 |Bill go go |
|7 |Bill i love ya |
|8 |Ramond is Bad |
|9 |Amy is great |
+---+-------------------+
join, groupBy and count should give your desired output as
df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)
Final output would be
+------+-----+
|name |count|
+------+-----+
|Ramond|2 |
|Bill |4 |
|Amy |2 |
|Bew |1 |
+------+-----+

need help to compare two columns in spark scala

I have spark dataframe like this
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
4 1 test3 3 0, 1, 2
5 3 test4 0 0, 1, 2
11 2 test Yes Yes, No
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
55 3 test4 0 0, 1, 2
val df = sqlContext.sql("select id1, id2, attrname, attr_value, attr_valuelist from dftable)
i want to check attr_value in attr_valuelist if it does not exists then take only those rows
id1 id2 attrname attr_value attr_valuelist
4 1 test3 3 0, 1, 2
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
you can simply do the following with contains in your dataframe
import org.apache.spark.sql.functions._
df.filter(!(col("attr_valuelist").contains(col("attr_value")))).show(false)
you should have following output
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|3 |2 |test2 |value1 |val1, Value1,value2|
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
If you want to ignore the case letters then you can simply user lower function as
df.filter(!(lower(col("attr_valuelist")).contains(lower(col("attr_value"))))).show(false)
you should have
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
You can define a custom function, user defined function in Spark, where you can test if a value from a column is contained in the value of the other column, like this:
def contains = udf((attr: String, attrList: String) => attrList.contains(attr))
def notContains = udf((attr: String, attrList: String) => !attrList.contains(attr))
you can tweak contains function how you want, and then you can select from your dataframe like this
df.where(contains(df("attr_value", df("attr_valuelist")))
df.where(notContains(df("attr_value", df("attr_valuelist")))