Spark: how to make value of new column based on different columns - pyspark

Spark 2.2.1
Pyspark
df = sqlContext.createDataFrame([
("dog", "1", "2", "3"),
("cat", "4", "5", "6"),
("dog", "7", "8", "9"),
("cat", "10", "11", "12"),
("dog", "13", "14", "15"),
("parrot", "16", "17", "18"),
("goldfish", "19", "20", "21"),
], ["pet", "dog_30", "cat_30", "parrot_30"])
And then I have list of the fields that I care above from the "Pet" column
dfvalues = ["dog", "cat", "parrot"]
I want to write code taht will give me the value from dog_30, cat_30 or parrot_30 that corresponds to the value in "pet". For example, in the first row the value for the pet column is dog and so we take the value for dog_30 which is 1.
I tried using this to get the code, but it just gives me nulls for the column stats. I also haven't figured out how to handle the goldfish case. I want to set that to 0.
mycols = [F.when(F.col("pet") == p + "_30", p) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(*stats) )
df.show()
Desired output:
+--------+------+------+---------+------+
| pet|dog_30|cat_30|parrot_30|stats |
+--------+------+------+---------+------+
| dog| 1| 2| 3| 1 |
| cat| 4| 5| 6| 5 |
| dog| 7| 8| 9| 7 |
| cat| 10| 11| 12| 11 |
| dog| 13| 14| 15| 13 |
| parrot| 16| 17| 18| 18 |
|goldfish| 19| 20| 21| 0 |
+--------+------+------+---------+------+

The logic is off; you need .when(F.col("pet") == p, F.col(p + '_30')):
mycols = [F.when(F.col("pet") == p, F.col(p + '_30')) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(F.coalesce(*mycols),F.lit(0)))
df.show()
+--------+------+------+---------+-------+
| pet|dog_30|cat_30|parrot_30|newCol2|
+--------+------+------+---------+-------+
| dog| 1| 2| 3| 1|
| cat| 4| 5| 6| 5|
| dog| 7| 8| 9| 7|
| cat| 10| 11| 12| 11|
| dog| 13| 14| 15| 13|
| parrot| 16| 17| 18| 18|
|goldfish| 19| 20| 21| 0|
+--------+------+------+---------+-------+

Related

Equivalent ungroup() from R in Pyspark

I try to group by values of metric which can be I or M and make a summe based on the values of x. The result should be stored in each row of its respective value. Normally I make it in R with group and then ungroup but I dont know the equivalent in PysSpark. Any advice?
from pyspark.sql.functions import *
from pyspark.sql.functions import last
from pyspark.sql.functions import arrays_zip
from pyspark.sql.types import *
data = [["1", "Amit", "DU", "I", "8", "6"],
["2", "Mohit", "DU", "I", "4", "2"],
["3", "rohith", "BHU", "I", "5", "3"],
["4", "sridevi", "LPU", "I", "1", "6"],
["1", "sravan", "KLMP", "M", "2", "4"],
["5", "gnanesh", "IIT", "M", "6", "8"],
["6", "gnadesh", "KLM", "M","0", "9"]]
columns = ['ID', 'NAME', 'college', 'metric', 'x', 'y']
dataframe = spark.createDataFrame(data, columns)
dataframe = dataframe.withColumn("x",dataframe.x.cast(DoubleType()))
This is how the data looks like
+---+-------+-------+------+----+---+
| ID| NAME|college|metric| x| y|
+---+-------+-------+------+----+---+
| 1| Amit| DU| I| 8| 6|
| 2| Mohit| DU| I| 4| 2|
| 3| rohith| BHU| I| 5| 3|
| 4|sridevi| LPU| I| 1| 6|
| 1| sravan| KLMP| M| 2| 4|
| 5|gnanesh| IIT| M| 6| 8|
| 6|gnadesh| KLM| M|0 | 9|
+---+-------+-------+------+----+---+
Expected output
+---+-------+-------+------+----+---+------+
| ID| NAME|college|metric| x| y| total|
+---+-------+-------+------+----+---+------+
| 1| Amit| DU| I| 8| 6| 18 |
| 2| Mohit| DU| I| 4| 2| 18 |
| 3| rohith| BHU| I| 5| 3| 18 |
| 4|sridevi| LPU| I| 1| 6| 18 |
| 1| sravan| KLMP| M| 2| 4| 8 |
| 5|gnanesh| IIT| M| 6| 8| 8 |
| 6|gnadesh| KLM| M| 0| 9| 8 |
+---+-------+-------+------+----+---+------+
I tried this but it does not work
dataframe.withColumn("total",dataframe.groupBy("metric").sum("x"))
You can do groupby on data and calculate the total value and then join the grouped dataframe with original data
metric_sum_df = dataframe.groupby('metric').agg(F.sum('x').alias('total'))
total_df = dataframe.join(metric_sum_df, 'metric')

Group rows with common id in pyspark

I try to analyze the number of combinations with the same label but in different periods.
from pyspark.sql.functions import *
from pyspark.sql.functions import last
from pyspark.sql.functions import arrays_zip
from pyspark.sql.types import *
from pyspark.sql import *
data = [["1", "2022-06-01 00:00:04.437000+02:00", "c", "A", "8", "6"],
["2", "2022-06-01 00:00:04.625000+02:00", "e", "A", "4", "2"],
["3", "2022-06-01 00:00:04.640000+02:00", "b", "A", "5", "3"],
["4", "2022-06-01 00:00:04.640000+02:00", "a", "A", "1", "6"],
["1", "2022-06-01 00:00:04.669000+02:00", "c", "B", "2", "4"],
["5", "2022-06-01 00:00:05.223000+02:00", "b", "B", "6", "8"],
["6", "2022-06-01 00:00:05.599886+02:00", "c", "A", None, "9"],
["1", "2022-06-01 00:00:05.740886+02:00", "b", "A", "8", "6"],
["2", "2022-06-01 00:00:05.937000+02:00", "a", "A", "4", "2"],
["3", "2022-06-01 00:00:05.937000+02:00", "e", "A", "5", "3"],
["4", "2022-06-01 00:00:30.746501-05:00", "b", "C", "1", "6"],
["1", "2022-06-01 00:00:30.747498-05:00", "d", "C", "2", "4"],
["5", "2022-06-01 00:00:30.789820+02:00", "b", "D", "6", "8"],
["6", "2022-06-01 00:00:31.062000+02:00", "e", "E", None, "9"],
["1", "2022-06-01 00:00:31.078000+02:00", "b", "E", "8", "6"],
["2", "2022-06-01 00:00:31.078000+02:00", "a", "F", "4", "2"],
["3", "2022-06-01 00:00:31.861017+02:00", "c", "G", "5", "3"],
["4", "2022-06-01 00:00:32.205639+00:00", "b", "H", "1", "6"],
["1", "2022-06-01 00:00:34.656000+02:00", "b", "I", "2", "4"],
["5", "2022-06-01 00:00:34.656000+02:00", "a", "I", "6", "8"],
["6", "2022-06-01 00:00:34.656000+02:00", "e", "I", None, "9"]]
columns = ['ID', 'source_timestamp', 'node_id', 'cd_equipment_no', 'x', 'y']
dataframe = spark.createDataFrame(data, columns)
dataframe = dataframe.withColumn("source_timestamp", to_timestamp(col("source_timestamp")))
This is how the data looks like
+---+--------------------+-------+---------------+----+---+
| ID| source_timestamp|node_id|cd_equipment_no| x| y|
+---+--------------------+-------+---------------+----+---+
| 1|2022-05-31 22:00:...| c| A| 8| 6|
| 2|2022-05-31 22:00:...| e| A| 4| 2|
| 3|2022-05-31 22:00:...| b| A| 5| 3|
| 4|2022-05-31 22:00:...| a| A| 1| 6|
| 1|2022-05-31 22:00:...| c| B| 2| 4|
| 5|2022-05-31 22:00:...| b| B| 6| 8|
| 6|2022-05-31 22:00:...| c| A|null| 9|
| 1|2022-05-31 22:00:...| b| A| 8| 6|
| 2|2022-05-31 22:00:...| a| A| 4| 2|
| 3|2022-05-31 22:00:...| e| A| 5| 3|
| 4|2022-06-01 05:00:...| b| C| 1| 6|
| 1|2022-06-01 05:00:...| d| C| 2| 4|
| 5|2022-05-31 22:00:...| b| D| 6| 8|
| 6|2022-05-31 22:00:...| e| E|null| 9|
| 1|2022-05-31 22:00:...| b| E| 8| 6|
| 2|2022-05-31 22:00:...| a| F| 4| 2|
| 3|2022-05-31 22:00:...| c| G| 5| 3|
| 4|2022-06-01 00:00:...| b| H| 1| 6|
| 1|2022-05-31 22:00:...| b| I| 2| 4|
| 5|2022-05-31 22:00:...| a| I| 6| 8|
+---+--------------------+-------+---------------+----+---+
My intention is to create an identifier when the time is sorted ascending based on source_timestamp
This is what I get
window = Window.partitionBy('cd_equipment_no').orderBy(col('source_timestamp'))
dataframe = dataframe.select('*', row_number().over(window).alias('posicion'))
+---+--------------------+-------+---------------+----+---+--------+--------+
| ID| source_timestamp|node_id|cd_equipment_no| x| y|posicion|posicion|
+---+--------------------+-------+---------------+----+---+--------+--------+
| 1|2022-05-31 22:00:...| c| A| 8| 6| 1| 1|
| 2|2022-05-31 22:00:...| e| A| 4| 2| 2| 2|
| 3|2022-05-31 22:00:...| b| A| 5| 3| 3| 3|
| 4|2022-05-31 22:00:...| a| A| 1| 6| 4| 4|
| 6|2022-05-31 22:00:...| c| A|null| 9| 7| 5|
| 1|2022-05-31 22:00:...| b| A| 8| 6| 8| 6|
| 2|2022-05-31 22:00:...| a| A| 4| 2| 9| 7|
| 3|2022-05-31 22:00:...| e| A| 5| 3| 10| 8|
| 1|2022-05-31 22:00:...| c| B| 2| 4| 5| 1|
| 5|2022-05-31 22:00:...| b| B| 6| 8| 6| 2|
| 4|2022-06-01 05:00:...| b| C| 1| 6| 20| 1|
| 1|2022-06-01 05:00:...| d| C| 2| 4| 21| 2|
| 5|2022-05-31 22:00:...| b| D| 6| 8| 11| 1|
| 6|2022-05-31 22:00:...| e| E|null| 9| 12| 1|
| 1|2022-05-31 22:00:...| b| E| 8| 6| 13| 2|
| 2|2022-05-31 22:00:...| a| F| 4| 2| 14| 1|
| 3|2022-05-31 22:00:...| c| G| 5| 3| 15| 1|
| 4|2022-06-01 00:00:...| b| H| 1| 6| 19| 1|
| 1|2022-05-31 22:00:...| b| I| 2| 4| 16| 1|
| 5|2022-05-31 22:00:...| a| I| 6| 8| 17| 2|
+---+--------------------+-------+---------------+----+---+--------+--------+
And this is what I want
+---+--------------------+-------+---------------+----+---+--------+--------+
| ID| source_timestamp|node_id|cd_equipment_no| x| y|posicion|posicion|
+---+--------------------+-------+---------------+----+---+--------+--------+
| 1|2022-05-31 22:00:...| c| A| 8| 6| 1| 1|
| 2|2022-05-31 22:00:...| e| A| 4| 2| 2| 2|
| 3|2022-05-31 22:00:...| b| A| 5| 3| 3| 3|
| 4|2022-05-31 22:00:...| a| A| 1| 6| 4| 4|
| 6|2022-05-31 22:00:...| c| A|null| 9| 7| 1|
| 1|2022-05-31 22:00:...| b| A| 8| 6| 8| 2|
| 2|2022-05-31 22:00:...| a| A| 4| 2| 9| 3|
| 3|2022-05-31 22:00:...| e| A| 5| 3| 10| 4|
| 1|2022-05-31 22:00:...| c| B| 2| 4| 5| 1|
| 5|2022-05-31 22:00:...| b| B| 6| 8| 6| 2|
| 4|2022-06-01 05:00:...| b| C| 1| 6| 20| 1|
| 1|2022-06-01 05:00:...| d| C| 2| 4| 21| 2|
| 5|2022-05-31 22:00:...| b| D| 6| 8| 11| 1|
| 6|2022-05-31 22:00:...| e| E|null| 9| 12| 1|
| 1|2022-05-31 22:00:...| b| E| 8| 6| 13| 2|
| 2|2022-05-31 22:00:...| a| F| 4| 2| 14| 1|
| 3|2022-05-31 22:00:...| c| G| 5| 3| 15| 1|
| 4|2022-06-01 00:00:...| b| H| 1| 6| 19| 1|
| 1|2022-05-31 22:00:...| b| I| 2| 4| 16| 1|
| 5|2022-05-31 22:00:...| a| I| 6| 8| 17| 2|
+---+--------------------+-------+---------------+----+---+--------+--------+

How to efficiently perform this column operation on a Spark Dataframe?

I have a dataframe as follows:
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 4|
| x| y| 1|
| t| y2| 6|
| t| y3| 4|
| t| y4| 5|
+---+---+---+
I want to add another column with value as (number of unique rows of "F1" and "F2" for each unique "F3" / total number of unique rows of "F1" and "F2").
For example, for the above table, below is the desired new dataframe:
+---+---+---+----+
| F1| F2| F3| F4|
+---+---+---+----+
| t| y4| 5| 1/6|
| x| y| 1| 1/6|
| x| y| 1| 1/6|
| x| z| 2| 1/6|
| t| y2| 6| 1/6|
| t| y3| 4| 2/6|
| x| a| 4| 2/6|
| x| a| 4| 2/6|
+---+---+---+----+
Note: in case of F3 = 4, there are only 2 unique F1 and F2 = {(t, y3), (x, a)}. Therefore, for all occurrences of F3 = 4, F4 will be 2/(total number of unique ordered pairs of F1 and F2. Here there are 6 such pairs)
How to achieve the above transformation in Spark Scala?
I just learnt trying to solve your problem, that you can't use Distinct functions while performing Window over DataFrames.
So what I did is create an temporary DataFrame and join it with the initial to obtain your desired results :
case class Dog(F1:String, F2: String, F3: Int)
val df = Seq(Dog("x", "y", 1), Dog("x", "z", 2), Dog("x", "a", 4), Dog("x", "a", 4), Dog("x", "y", 1), Dog("t", "y2", 6), Dog("t", "y3", 4), Dog("t", "y4", 5)).toDF
val unique_F1_F2 = df.select("F1", "F2").distinct.count
val dd = df.withColumn("X1", concat(col("F1"), col("F2")))
.groupBy("F3")
.agg(countDistinct(col("X1")).as("distinct_count"))
val final_df = dd.join(df, "F3")
.withColumn("F4", col("distinct_count")/unique_F1_F2)
.drop("distinct_count")
final_df.show
+---+---+---+-------------------+
| F3| F1| F2| F4|
+---+---+---+-------------------+
| 1| x| y|0.16666666666666666|
| 1| x| y|0.16666666666666666|
| 6| t| y2|0.16666666666666666|
| 5| t| y4|0.16666666666666666|
| 4| t| y3| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 2| x| z|0.16666666666666666|
+---+---+---+-------------------+
I hope this is what you expected !
EDIT : I changed df.count to unique_F1_F2

New column receives the value Null

I have the following DataFrame df
+-----------+-----------+-----------+
|CommunityId|nodes_count|edges_count|
+-----------+-----------+-----------+
| 26| 3| 11|
| 964| 16| 18|
| 1806| 9| 31|
| 2040| 13| 12|
| 2214| 8| 8|
| 2927| 7| 7|
Then I add the column Rate as follows:
df
.withColumn("Rate",when(col("nodes_count") =!= 0, (lit("edges_count")/lit("nodes_count")).as[Double]).otherwise(0.0))
This is what I get:
+-----------+-----------+-----------+-----------------------+
|CommunityId|nodes_count|edges_count| Rate|
+-----------+-----------+-----------+-----------------------+
| 26| 3| 11| null|
| 964| 16| 18| null|
| 1806| 9| 31| null|
| 2040| 13| 12| null|
| 2214| 8| 8| null|
| 2927| 7| 7| null|
For some reason Rate is always equal to null.
That happens because you use lit. You should use col instead:
df
.withColumn(
"Rate" ,when(col("nodes_count") =!= 0,
(col("edges_count") / col("nodes_count")).as[Double]).otherwise(0.0))
although both when and as Double are useless here, and simple division would be more than sufficient:
df.withColumn("Rate", col("edges_count") / col("nodes_count"))

spark sql conditional maximum

I have a tall table which contains up to 10 values per group. How can I transform this table into a wide format i.e. add 2 columns where these resemble the value smaller or equal to a threshold?
I want to find the maximum per group, but it needs to be smaller than a specified value like:
min(max('value1), lit(5)).over(Window.partitionBy('grouping))
However min()will only work for a column and not for the Scala value which is returned from the inner function?
The problem can be described as:
Seq(Seq(1,2,3,4).max,5).min
Where Seq(1,2,3,4) is returned by the window.
How can I formulate this in spark sql?
edit
E.g.
+--------+-----+---------+
|grouping|value|something|
+--------+-----+---------+
| 1| 1| first|
| 1| 2| second|
| 1| 3| third|
| 1| 4| fourth|
| 1| 7| 7|
| 1| 10| 10|
| 21| 1| first|
| 21| 2| second|
| 21| 3| third|
+--------+-----+---------+
created by
case class MyThing(grouping: Int, value:Int, something:String)
val df = Seq(MyThing(1,1, "first"), MyThing(1,2, "second"), MyThing(1,3, "third"),MyThing(1,4, "fourth"),MyThing(1,7, "7"), MyThing(1,10, "10"),
MyThing(21,1, "first"), MyThing(21,2, "second"), MyThing(21,3, "third")).toDS
Where
df
.withColumn("somethingAtLeast5AndMaximum5", max('value).over(Window.partitionBy('grouping)))
.withColumn("somethingAtLeast6OupToThereshold2", max('value).over(Window.partitionBy('grouping)))
.show
returns
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5| somethingAtLeast6OupToThereshold2 |
+--------+-----+---------+----------------------------+-------------------------+
| 1| 1| first| 10| 10|
| 1| 2| second| 10| 10|
| 1| 3| third| 10| 10|
| 1| 4| fourth| 10| 10|
| 1| 7| 7| 10| 10|
| 1| 10| 10| 10| 10|
| 21| 1| first| 3| 3|
| 21| 2| second| 3| 3|
| 21| 3| third| 3| 3|
+--------+-----+---------+----------------------------+-------------------------+
Instead, I rather would want to formulate:
lit(Seq(max('value).asInstanceOf[java.lang.Integer], new java.lang.Integer(2)).min).over(Window.partitionBy('grouping))
But that does not work as max('value) is not a scalar value.
Expected output should look like
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5|somethingAtLeast6OupToThereshold2|
+--------+-----+---------+----------------------------+-------------------------+
| 1| 4| fourth| 4| 7|
| 21| 1| first| 3| NULL|
+--------+-----+---------+----------------------------+-------------------------+
edit2
When trying a pivot
df.groupBy("grouping").pivot("value").agg(first('something)).show
+--------+-----+------+-----+------+----+----+
|grouping| 1| 2| 3| 4| 7| 10|
+--------+-----+------+-----+------+----+----+
| 1|first|second|third|fourth| 7| 10|
| 21|first|second|third| null|null|null|
+--------+-----+------+-----+------+----+----+
The second part of the problem remains that some columns might not exist or be null.
When aggregating to arrays:
df.groupBy("grouping").agg(collect_list('value).alias("value"), collect_list('something).alias("something"))
+--------+-------------------+--------------------+
|grouping| value| something|
+--------+-------------------+--------------------+
| 1|[1, 2, 3, 4, 7, 10]|[first, second, t...|
| 21| [1, 2, 3]|[first, second, t...|
+--------+-------------------+--------------------+
The values are already next to each other, but the right values need to be selected. This is probably still more efficient than a join or window function.
Would be easier to do in two separate steps - calculate max over Window, and then use when...otherwise on result to produce min(x, 5):
df.withColumn("tmp", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('tmp > lit(5), 5).otherwise('tmp))
EDIT: some example data to clarify this:
val df = Seq((1, 1),(1, 2),(1, 3),(1, 4),(2, 7),(2, 8))
.toDF("grouping", "value1")
df.withColumn("result", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('result > lit(5), 5).otherwise('result))
.show()
// +--------+------+------+
// |grouping|value1|result|
// +--------+------+------+
// | 1| 1| 4| // 4, because Seq(Seq(1,2,3,4).max,5).min = 4
// | 1| 2| 4|
// | 1| 3| 4|
// | 1| 4| 4|
// | 2| 7| 5| // 5, because Seq(Seq(7,8).max,5).min = 5
// | 2| 8| 5|
// +--------+------+------+