I try to group by values of metric which can be I or M and make a summe based on the values of x. The result should be stored in each row of its respective value. Normally I make it in R with group and then ungroup but I dont know the equivalent in PysSpark. Any advice?
from pyspark.sql.functions import *
from pyspark.sql.functions import last
from pyspark.sql.functions import arrays_zip
from pyspark.sql.types import *
data = [["1", "Amit", "DU", "I", "8", "6"],
["2", "Mohit", "DU", "I", "4", "2"],
["3", "rohith", "BHU", "I", "5", "3"],
["4", "sridevi", "LPU", "I", "1", "6"],
["1", "sravan", "KLMP", "M", "2", "4"],
["5", "gnanesh", "IIT", "M", "6", "8"],
["6", "gnadesh", "KLM", "M","0", "9"]]
columns = ['ID', 'NAME', 'college', 'metric', 'x', 'y']
dataframe = spark.createDataFrame(data, columns)
dataframe = dataframe.withColumn("x",dataframe.x.cast(DoubleType()))
This is how the data looks like
+---+-------+-------+------+----+---+
| ID| NAME|college|metric| x| y|
+---+-------+-------+------+----+---+
| 1| Amit| DU| I| 8| 6|
| 2| Mohit| DU| I| 4| 2|
| 3| rohith| BHU| I| 5| 3|
| 4|sridevi| LPU| I| 1| 6|
| 1| sravan| KLMP| M| 2| 4|
| 5|gnanesh| IIT| M| 6| 8|
| 6|gnadesh| KLM| M|0 | 9|
+---+-------+-------+------+----+---+
Expected output
+---+-------+-------+------+----+---+------+
| ID| NAME|college|metric| x| y| total|
+---+-------+-------+------+----+---+------+
| 1| Amit| DU| I| 8| 6| 18 |
| 2| Mohit| DU| I| 4| 2| 18 |
| 3| rohith| BHU| I| 5| 3| 18 |
| 4|sridevi| LPU| I| 1| 6| 18 |
| 1| sravan| KLMP| M| 2| 4| 8 |
| 5|gnanesh| IIT| M| 6| 8| 8 |
| 6|gnadesh| KLM| M| 0| 9| 8 |
+---+-------+-------+------+----+---+------+
I tried this but it does not work
dataframe.withColumn("total",dataframe.groupBy("metric").sum("x"))
You can do groupby on data and calculate the total value and then join the grouped dataframe with original data
metric_sum_df = dataframe.groupby('metric').agg(F.sum('x').alias('total'))
total_df = dataframe.join(metric_sum_df, 'metric')
Related
I try to analyze the number of combinations with the same label but in different periods.
from pyspark.sql.functions import *
from pyspark.sql.functions import last
from pyspark.sql.functions import arrays_zip
from pyspark.sql.types import *
from pyspark.sql import *
data = [["1", "2022-06-01 00:00:04.437000+02:00", "c", "A", "8", "6"],
["2", "2022-06-01 00:00:04.625000+02:00", "e", "A", "4", "2"],
["3", "2022-06-01 00:00:04.640000+02:00", "b", "A", "5", "3"],
["4", "2022-06-01 00:00:04.640000+02:00", "a", "A", "1", "6"],
["1", "2022-06-01 00:00:04.669000+02:00", "c", "B", "2", "4"],
["5", "2022-06-01 00:00:05.223000+02:00", "b", "B", "6", "8"],
["6", "2022-06-01 00:00:05.599886+02:00", "c", "A", None, "9"],
["1", "2022-06-01 00:00:05.740886+02:00", "b", "A", "8", "6"],
["2", "2022-06-01 00:00:05.937000+02:00", "a", "A", "4", "2"],
["3", "2022-06-01 00:00:05.937000+02:00", "e", "A", "5", "3"],
["4", "2022-06-01 00:00:30.746501-05:00", "b", "C", "1", "6"],
["1", "2022-06-01 00:00:30.747498-05:00", "d", "C", "2", "4"],
["5", "2022-06-01 00:00:30.789820+02:00", "b", "D", "6", "8"],
["6", "2022-06-01 00:00:31.062000+02:00", "e", "E", None, "9"],
["1", "2022-06-01 00:00:31.078000+02:00", "b", "E", "8", "6"],
["2", "2022-06-01 00:00:31.078000+02:00", "a", "F", "4", "2"],
["3", "2022-06-01 00:00:31.861017+02:00", "c", "G", "5", "3"],
["4", "2022-06-01 00:00:32.205639+00:00", "b", "H", "1", "6"],
["1", "2022-06-01 00:00:34.656000+02:00", "b", "I", "2", "4"],
["5", "2022-06-01 00:00:34.656000+02:00", "a", "I", "6", "8"],
["6", "2022-06-01 00:00:34.656000+02:00", "e", "I", None, "9"]]
columns = ['ID', 'source_timestamp', 'node_id', 'cd_equipment_no', 'x', 'y']
dataframe = spark.createDataFrame(data, columns)
dataframe = dataframe.withColumn("source_timestamp", to_timestamp(col("source_timestamp")))
This is how the data looks like
+---+--------------------+-------+---------------+----+---+
| ID| source_timestamp|node_id|cd_equipment_no| x| y|
+---+--------------------+-------+---------------+----+---+
| 1|2022-05-31 22:00:...| c| A| 8| 6|
| 2|2022-05-31 22:00:...| e| A| 4| 2|
| 3|2022-05-31 22:00:...| b| A| 5| 3|
| 4|2022-05-31 22:00:...| a| A| 1| 6|
| 1|2022-05-31 22:00:...| c| B| 2| 4|
| 5|2022-05-31 22:00:...| b| B| 6| 8|
| 6|2022-05-31 22:00:...| c| A|null| 9|
| 1|2022-05-31 22:00:...| b| A| 8| 6|
| 2|2022-05-31 22:00:...| a| A| 4| 2|
| 3|2022-05-31 22:00:...| e| A| 5| 3|
| 4|2022-06-01 05:00:...| b| C| 1| 6|
| 1|2022-06-01 05:00:...| d| C| 2| 4|
| 5|2022-05-31 22:00:...| b| D| 6| 8|
| 6|2022-05-31 22:00:...| e| E|null| 9|
| 1|2022-05-31 22:00:...| b| E| 8| 6|
| 2|2022-05-31 22:00:...| a| F| 4| 2|
| 3|2022-05-31 22:00:...| c| G| 5| 3|
| 4|2022-06-01 00:00:...| b| H| 1| 6|
| 1|2022-05-31 22:00:...| b| I| 2| 4|
| 5|2022-05-31 22:00:...| a| I| 6| 8|
+---+--------------------+-------+---------------+----+---+
My intention is to create an identifier when the time is sorted ascending based on source_timestamp
This is what I get
window = Window.partitionBy('cd_equipment_no').orderBy(col('source_timestamp'))
dataframe = dataframe.select('*', row_number().over(window).alias('posicion'))
+---+--------------------+-------+---------------+----+---+--------+--------+
| ID| source_timestamp|node_id|cd_equipment_no| x| y|posicion|posicion|
+---+--------------------+-------+---------------+----+---+--------+--------+
| 1|2022-05-31 22:00:...| c| A| 8| 6| 1| 1|
| 2|2022-05-31 22:00:...| e| A| 4| 2| 2| 2|
| 3|2022-05-31 22:00:...| b| A| 5| 3| 3| 3|
| 4|2022-05-31 22:00:...| a| A| 1| 6| 4| 4|
| 6|2022-05-31 22:00:...| c| A|null| 9| 7| 5|
| 1|2022-05-31 22:00:...| b| A| 8| 6| 8| 6|
| 2|2022-05-31 22:00:...| a| A| 4| 2| 9| 7|
| 3|2022-05-31 22:00:...| e| A| 5| 3| 10| 8|
| 1|2022-05-31 22:00:...| c| B| 2| 4| 5| 1|
| 5|2022-05-31 22:00:...| b| B| 6| 8| 6| 2|
| 4|2022-06-01 05:00:...| b| C| 1| 6| 20| 1|
| 1|2022-06-01 05:00:...| d| C| 2| 4| 21| 2|
| 5|2022-05-31 22:00:...| b| D| 6| 8| 11| 1|
| 6|2022-05-31 22:00:...| e| E|null| 9| 12| 1|
| 1|2022-05-31 22:00:...| b| E| 8| 6| 13| 2|
| 2|2022-05-31 22:00:...| a| F| 4| 2| 14| 1|
| 3|2022-05-31 22:00:...| c| G| 5| 3| 15| 1|
| 4|2022-06-01 00:00:...| b| H| 1| 6| 19| 1|
| 1|2022-05-31 22:00:...| b| I| 2| 4| 16| 1|
| 5|2022-05-31 22:00:...| a| I| 6| 8| 17| 2|
+---+--------------------+-------+---------------+----+---+--------+--------+
And this is what I want
+---+--------------------+-------+---------------+----+---+--------+--------+
| ID| source_timestamp|node_id|cd_equipment_no| x| y|posicion|posicion|
+---+--------------------+-------+---------------+----+---+--------+--------+
| 1|2022-05-31 22:00:...| c| A| 8| 6| 1| 1|
| 2|2022-05-31 22:00:...| e| A| 4| 2| 2| 2|
| 3|2022-05-31 22:00:...| b| A| 5| 3| 3| 3|
| 4|2022-05-31 22:00:...| a| A| 1| 6| 4| 4|
| 6|2022-05-31 22:00:...| c| A|null| 9| 7| 1|
| 1|2022-05-31 22:00:...| b| A| 8| 6| 8| 2|
| 2|2022-05-31 22:00:...| a| A| 4| 2| 9| 3|
| 3|2022-05-31 22:00:...| e| A| 5| 3| 10| 4|
| 1|2022-05-31 22:00:...| c| B| 2| 4| 5| 1|
| 5|2022-05-31 22:00:...| b| B| 6| 8| 6| 2|
| 4|2022-06-01 05:00:...| b| C| 1| 6| 20| 1|
| 1|2022-06-01 05:00:...| d| C| 2| 4| 21| 2|
| 5|2022-05-31 22:00:...| b| D| 6| 8| 11| 1|
| 6|2022-05-31 22:00:...| e| E|null| 9| 12| 1|
| 1|2022-05-31 22:00:...| b| E| 8| 6| 13| 2|
| 2|2022-05-31 22:00:...| a| F| 4| 2| 14| 1|
| 3|2022-05-31 22:00:...| c| G| 5| 3| 15| 1|
| 4|2022-06-01 00:00:...| b| H| 1| 6| 19| 1|
| 1|2022-05-31 22:00:...| b| I| 2| 4| 16| 1|
| 5|2022-05-31 22:00:...| a| I| 6| 8| 17| 2|
+---+--------------------+-------+---------------+----+---+--------+--------+
I am facing a problem in PySpark Dataframe loaded from a CSV file , where my numeric column do have empty values Like below
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| |
| Abid Ali, S| 29| 5| |
|Adhikari, H R| 21| | |
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Casted those columns to integer and all those empty become null
df_data_csv_casted = df_data_csv.select(df_data_csv['Country'],df_data_csv['Player_Name'], df_data_csv['Test_Matches'].cast(IntegerType()).alias("Test_Matches"), df_data_csv['ODI_Matches'].cast(IntegerType()).alias("ODI_Matches"), df_data_csv['T20_Matches'].cast(IntegerType()).alias("T20_Matches"))
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| null|
| Abid Ali, S| 29| 5| null|
|Adhikari, H R| 21| null| null|
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Then I am taking a total , but if one of them is null , result is also coming as null. How to solve it ?
df_data_csv_withTotalCol=df_data_csv_casted.withColumn('Total_Matches',(df_data_csv_casted['Test_Matches']+df_data_csv_casted['ODI_Matches']+df_data_csv_casted['T20_Matches']))
+-------------+------------+-----------+-----------+-------------+
|Player_Name |Test_Matches|ODI_Matches|T20_Matches|Total_Matches|
+-------------+------------+-----------+-----------+-------------+
| Aaron, V R | 9| 9| null| null|
|Abid Ali, S | 29| 5| null| null|
|Adhikari, H R| 21| null| null| null|
|Agarkar, A B | 26| 191| 4| 221|
+-------------+------------+-----------+-----------+-------------+
You can fix this by using coalesce function . for example , lets create some sample data
from pyspark.sql.functions import coalesce,lit
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDf.show()
+----+----+
| a| b|
+----+----+
|null|null|
| 1|null|
|null| 2|
+----+----+
When I do simple sum as you did -
cDf.withColumn('Total',cDf.a+cDf.b).show()
I get total as null , same as you described-
+----+----+-----+
| a| b|Total|
+----+----+-----+
|null|null| null|
| 1|null| null|
|null| 2| null|
+----+----+-----+
to fix, use coalesce along with lit function , which replaces null values by zeroes.
cDf.withColumn('Total',coalesce(cDf.a,lit(0)) +coalesce(cDf.b,lit(0))).show()
this gives me correct results-
| a| b|Total|
+----+----+-----+
|null|null| 0|
| 1|null| 1|
|null| 2| 2|
+----+----+-----+
I have a dataframe as follows:
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 4|
| x| y| 1|
| t| y2| 6|
| t| y3| 4|
| t| y4| 5|
+---+---+---+
I want to add another column with value as (number of unique rows of "F1" and "F2" for each unique "F3" / total number of unique rows of "F1" and "F2").
For example, for the above table, below is the desired new dataframe:
+---+---+---+----+
| F1| F2| F3| F4|
+---+---+---+----+
| t| y4| 5| 1/6|
| x| y| 1| 1/6|
| x| y| 1| 1/6|
| x| z| 2| 1/6|
| t| y2| 6| 1/6|
| t| y3| 4| 2/6|
| x| a| 4| 2/6|
| x| a| 4| 2/6|
+---+---+---+----+
Note: in case of F3 = 4, there are only 2 unique F1 and F2 = {(t, y3), (x, a)}. Therefore, for all occurrences of F3 = 4, F4 will be 2/(total number of unique ordered pairs of F1 and F2. Here there are 6 such pairs)
How to achieve the above transformation in Spark Scala?
I just learnt trying to solve your problem, that you can't use Distinct functions while performing Window over DataFrames.
So what I did is create an temporary DataFrame and join it with the initial to obtain your desired results :
case class Dog(F1:String, F2: String, F3: Int)
val df = Seq(Dog("x", "y", 1), Dog("x", "z", 2), Dog("x", "a", 4), Dog("x", "a", 4), Dog("x", "y", 1), Dog("t", "y2", 6), Dog("t", "y3", 4), Dog("t", "y4", 5)).toDF
val unique_F1_F2 = df.select("F1", "F2").distinct.count
val dd = df.withColumn("X1", concat(col("F1"), col("F2")))
.groupBy("F3")
.agg(countDistinct(col("X1")).as("distinct_count"))
val final_df = dd.join(df, "F3")
.withColumn("F4", col("distinct_count")/unique_F1_F2)
.drop("distinct_count")
final_df.show
+---+---+---+-------------------+
| F3| F1| F2| F4|
+---+---+---+-------------------+
| 1| x| y|0.16666666666666666|
| 1| x| y|0.16666666666666666|
| 6| t| y2|0.16666666666666666|
| 5| t| y4|0.16666666666666666|
| 4| t| y3| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 2| x| z|0.16666666666666666|
+---+---+---+-------------------+
I hope this is what you expected !
EDIT : I changed df.count to unique_F1_F2
Spark 2.2.1
Pyspark
df = sqlContext.createDataFrame([
("dog", "1", "2", "3"),
("cat", "4", "5", "6"),
("dog", "7", "8", "9"),
("cat", "10", "11", "12"),
("dog", "13", "14", "15"),
("parrot", "16", "17", "18"),
("goldfish", "19", "20", "21"),
], ["pet", "dog_30", "cat_30", "parrot_30"])
And then I have list of the fields that I care above from the "Pet" column
dfvalues = ["dog", "cat", "parrot"]
I want to write code taht will give me the value from dog_30, cat_30 or parrot_30 that corresponds to the value in "pet". For example, in the first row the value for the pet column is dog and so we take the value for dog_30 which is 1.
I tried using this to get the code, but it just gives me nulls for the column stats. I also haven't figured out how to handle the goldfish case. I want to set that to 0.
mycols = [F.when(F.col("pet") == p + "_30", p) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(*stats) )
df.show()
Desired output:
+--------+------+------+---------+------+
| pet|dog_30|cat_30|parrot_30|stats |
+--------+------+------+---------+------+
| dog| 1| 2| 3| 1 |
| cat| 4| 5| 6| 5 |
| dog| 7| 8| 9| 7 |
| cat| 10| 11| 12| 11 |
| dog| 13| 14| 15| 13 |
| parrot| 16| 17| 18| 18 |
|goldfish| 19| 20| 21| 0 |
+--------+------+------+---------+------+
The logic is off; you need .when(F.col("pet") == p, F.col(p + '_30')):
mycols = [F.when(F.col("pet") == p, F.col(p + '_30')) for p in dfvalues]
df = df.withColumn("newCol2",F.coalesce(F.coalesce(*mycols),F.lit(0)))
df.show()
+--------+------+------+---------+-------+
| pet|dog_30|cat_30|parrot_30|newCol2|
+--------+------+------+---------+-------+
| dog| 1| 2| 3| 1|
| cat| 4| 5| 6| 5|
| dog| 7| 8| 9| 7|
| cat| 10| 11| 12| 11|
| dog| 13| 14| 15| 13|
| parrot| 16| 17| 18| 18|
|goldfish| 19| 20| 21| 0|
+--------+------+------+---------+-------+
I'm using Spark 1.6.1, and I have such a dataframe.
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
| scene_id| action_id| classifier|os_name|country|app_ver| p0value|p1value|p2value|p3value|p4value|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
| test_home|scene_enter| test_home|android| KR| 5.6.3|__OTHERS__| false| test| test| test|
......
And I want to get dataframe like as following by using cube operation.
(Grouped by all fields, but only "os_name", "country", "app_ver" fields are cubed)
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
| scene_id| action_id| classifier|os_name|country|app_ver| p0value|p1value|p2value|p3value|p4value|cnt|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
| test_home|scene_enter| test_home|android| KR| 5.6.3|__OTHERS__| false| test| test| test| 9|
| test_home|scene_enter| test_home| null| KR| 5.6.3|__OTHERS__| false| test| test| test| 35|
| test_home|scene_enter| test_home|android| null| 5.6.3|__OTHERS__| false| test| test| test| 98|
| test_home|scene_enter| test_home|android| KR| null|__OTHERS__| false| test| test| test|101|
| test_home|scene_enter| test_home| null| null| 5.6.3|__OTHERS__| false| test| test| test|301|
| test_home|scene_enter| test_home| null| KR| null|__OTHERS__| false| test| test| test|225|
| test_home|scene_enter| test_home|android| null| null|__OTHERS__| false| test| test| test|312|
| test_home|scene_enter| test_home| null| null| null|__OTHERS__| false| test| test| test|521|
......
I have tried like below, but it seems to be slow and ugly..
var cubed = df
.cube($"scene_id", $"action_id", $"classifier", $"country", $"os_name", $"app_ver", $"p0value", $"p1value", $"p2value", $"p3value", $"p4value")
.count
.where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL")
Any better solutions?
I believe you cannot avoid the problem completely but there is a simple trick you can reduce its scale. The idea is to replace all columns, which shouldn't be marginalized, with a single placeholder.
For example if you have a DataFrame:
val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")
and you're interested in cube marginalized by d and e and grouped by a..c you can define the substitute for a..c as:
import org.apache.spark.sql.functions.struct
import sparkSql.implicits._
// alias here may not work in Spark 1.6
val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest")
and cube:
val cubed = Seq($"d", $"e")
// If there is a problem with aliasing rest it can done here.
val tmp = df.cube(rest.alias("rest") +: cubed: _*).count
Quick filter and select should handle the rest:
tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)
with result like:
+---+---+---+----+----+-----+
| a| b| c| d| e|count|
+---+---+---+----+----+-----+
| 1| 2| 3|null| 5| 1|
| 1| 2| 3|null|null| 1|
| 1| 2| 3| 4| 5| 1|
| 1| 2| 3| 4|null| 1|
+---+---+---+----+----+-----+