How can I add one column to other columns in PySpark? - pyspark

I have the following PySpark DataFrame where each column represents a time series and I'd like to study their distance to the mean.
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1 | 2 | ... | 2 |
| -1 | 5 | ... | 4 |
+----+----+-----+---------+
This is what I'm hoping to get:
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0 | ... | 2 |
| -5 | 1 | ... | 4 |
+----+----+-----+---------+
Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.
cols = ["T1", "T2", ...]
for c in cols:
df = df.withColumn(c, df[c] - df["Average"])
Is there a better way to do this transformation of adding one column to many other?

By using rdd, it can be done in this way.
+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1 |2 |2 |
|-1 |5 |4 |
+---+---+-------+
df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
.toDF(df.columns).show()
+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1| 0| 2|
| -5| 1| 4|
+---+---+-------+

Related

Use different dataframes to create new one with information (Scala Spark)

I have one dataframe with games and three valoration for every game from different reviews, every valoration is traduced in another dataframe as you can see:
Df_reviews
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |XX+ | K2 | L1 |
|FT |Z- | K1+ | L3 |
Df_rev1
+----------+-------------+
| review_1 | Equivalence |
+----------+-------------+
|XX+ | 9 |
|Y | 6 |
|Z- | 3 |
Df_rev2
+----------+-------------+
| review_2 | Equivalence |
+----------+-------------+
|K2 | 7 |
|K1+ | 6 |
|K3 | 10 |
Df_rev3
+----------+-------------+
| review_3 | Equivalence |
+----------+-------------+
|L3 | 10 |
|L2 | 9 |
|L1 | 8 |
I have to traduce it in a new dataframe with the valoration traduced and add a column with the second best valoration, for this example would be:
Df_output
+--------+---------+---------+----------+-------------+
|Game | rev_1_t | rev_2_t | rev_3_t | second_best |
+--------+---------+---------+----------+-------------+
|CA | 9 | 7 | 8 | 8 |
|FT | 3 | 6 | 10 | 6 |
To traduce it, I am trying with a left join but I am so lost. How can I deal with this?
####### Second Part ######
How can I translate columns from one dataframe to others from another dataframe, joining with multiple columns vs one? for example:
Df_revuews
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |XX+ | K2 | L1 |
|FT |Z- | K1+ | L3 |
Df_equiv
+--------+-------+
|Valorat | num |
+- ------+-------+
|X |3 |
|XX+ |5 |
|Z |7 |
|Z- |6 |
|K1+ |6 |
|K2 |4 |
|L1 |5 |
|L2 |6 |
|L3 |7 |
Output
+--------+-------+-------+--------+
|Game | rev_1 | rev_2 | rev_3 |
+- ------+-------+-------+--------+
|CA |5 | 4 | 5 |
|FT |6 | 6 | 7 |
I am doing this as you can see:
val joined = df_reviews
.join(df_equiv, df_reviews("rev_1") === df_equiv("num") && df_reviews("rev_2") === df_equiv("num")
&& df_reviews("rev_3") === df_equiv("num"), "left")
.select(df_reviews("Game"),
df_equiv("num").as("rev_1_t"),
df_equiv("num").as("rev_2_t"),
df_equiv("num").as("rev_3_t")
)
Thanks in advance!
You can do some left joins and get the second highest column using sort_array:
val joined = df_reviews
.join(df_rev1, df_reviews("rev_1") === df_rev1("review_1"), "left")
.join(df_rev2, df_reviews("rev_2") === df_rev2("review_2"), "left")
.join(df_rev3, df_reviews("rev_3") === df_rev3("review_3"), "left")
.select(df_reviews("Game"),
df_rev1("Equivalence").as("rev_1_t"),
df_rev2("Equivalence").as("rev_2_t"),
df_rev3("Equivalence").as("rev_3_t")
)
val array_sort_udf = udf((x: Seq[Int]) => x.sortBy(_ != null))
val result = joined.withColumn(
"second_best",
coalesce(
array_sort_udf(
array(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
)(1),
greatest(col("rev_1_t").cast("int"), col("rev_2_t").cast("int"), col("rev_3_t").cast("int"))
)
)
result.show
+----+-------+-------+-------+-----------+
|Game|rev_1_t|rev_2_t|rev_3_t|second_best|
+----+-------+-------+-------+-----------+
| CA| 9| 7| 8| 8|
| FT| 3| 6| 10| 6|
+----+-------+-------+-------+-----------+
For your second question:
val joined = df_reviews.as("r1")
.join(df_equiv.as("e1"), expr("r1.rev_1 = e1.Valorat"), "left")
.selectExpr("Game", "e1.num as rev_1", "rev_2", "rev_3")
.as("r2")
.join(df_equiv.as("e2"), expr("r2.rev_2 = e2.Valorat"), "left")
.selectExpr("Game", "rev_1", "e2.num as rev_2", "rev_3")
.as("r3")
.join(df_equiv.as("e3"), expr("r3.rev_3 = e3.Valorat"), "left")
.selectExpr("Game", "rev_1", "rev_2", "e3.num as rev_3")
joined.show
+----+-----+-----+-----+
|Game|rev_1|rev_2|rev_3|
+----+-----+-----+-----+
| CA| 5| 4| 5|
| FT| 6| 6| 7|
+----+-----+-----+-----+

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful

GroupBy based on conditions in Spark dataframe

I have two dataframe,
Dataframe1 contains key/value pairs:
+------+-----------------+
| Key | Value |
+------+-----------------+
| key1 | Column1 |
+------+-----------------+
| key2 | Column2 |
+------+-----------------+
| key3 | Column1,Column3 |
+------+-----------------+
Second dataframe:
This is actual dataframe where I need to apply groupBy operation
+---------+---------+---------+--------+
| Column1 | Column2 | Column3 | Amount |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A2 | XYZ | 10 |
+---------+---------+---------+--------+
| A | A3 | PQR | 100 |
+---------+---------+---------+--------+
| B | B1 | XYZ | 200 |
+---------+---------+---------+--------+
| B | B2 | PQR | 280 |
+---------+---------+---------+--------+
| B | B3 | XYZ | 20 |
+---------+---------+---------+--------+
Dataframe1 contains the key,value columns
It has to take the keys from dataframe1, it has to take the respective value and do the groupBy operation on the dataframe2
Dframe= df.groupBy($"key").sum("amount").show()
Expected Output: Generate three dataframes based on number of keys in dataframe
d1= df.grouBy($"key1").sum("amount").show()
it has to be : df.grouBy($"column1").sum("amount").show()
+---+-----+
| A | 310 |
+---+-----+
| B | 500 |
+---+-----+
Code:
d2=df.groupBy($"key2").sum("amount").show()
result: df.grouBy($"column2").sum("amount").show()
dataframe:
+----+-----+
| A1 | 200 |
+----+-----+
| A2 | 10 |
+----+-----+
Code :
d3.df.groupBy($"key3").sum("amount").show()
DataFrame:
+---+-----+-----+
| A | XYZ | 320 |
+---+-----+-----+
| A | PQR | 10 |
+---+-----+-----+
| B | XYZ | 220 |
+---+-----+-----+
| B | PQR | 280 |
+---+-----+-----+
In future, if I add more keys , it has to show the dataframe. Can someone help me.
Given the key value dataframe as ( which I suggest you not to form dataframe from the source data, reason is given below)
+----+---------------+
|Key |Value |
+----+---------------+
|key1|Column1 |
|key2|Column2 |
|key3|Column1,Column3|
+----+---------------+
and actual dataframe as
+-------+-------+-------+------+
|Column1|Column2|Column3|Amount|
+-------+-------+-------+------+
|A |A1 |XYZ |100 |
|A |A1 |XYZ |100 |
|A |A2 |XYZ |10 |
|A |A3 |PQR |100 |
|B |B1 |XYZ |200 |
|B |B2 |PQR |280 |
|B |B3 |XYZ |20 |
+-------+-------+-------+------+
I would suggest you not to convert the first dataframe to rdd maps as
val maps = df1.rdd.map(row => row(0) -> row(1)).collect()
And then loop the maps as
import org.apache.spark.sql.functions._
for(kv <- maps){
df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false)
//you can store the results in separate dataframes or write them to files or database
}
You should have follwing outputs
+-------+-----------+
|Column1|sum(Amount)|
+-------+-----------+
|B |500 |
|A |310 |
+-------+-----------+
+-------+-----------+
|Column2|sum(Amount)|
+-------+-----------+
|A2 |10 |
|B2 |280 |
|B1 |200 |
|B3 |20 |
|A3 |100 |
|A1 |200 |
+-------+-----------+
+-------+-------+-----------+
|Column1|Column3|sum(Amount)|
+-------+-------+-----------+
|B |PQR |280 |
|B |XYZ |220 |
|A |PQR |100 |
|A |XYZ |210 |
+-------+-------+-----------+

Spark SQL window function look ahead and complex function

I have the following data:
+-----+----+-----+
|event|t |type |
+-----+----+-----+
| A |20 | 1 |
| A |40 | 1 |
| B |10 | 1 |
| B |20 | 1 |
| B |120 | 1 |
| B |140 | 1 |
| B |320 | 1 |
| B |340 | 1 |
| B |360 | 7 |
| B |380 | 1 |
+-----+-----+----+
And what I want is something like this:
+-----+----+----+
|event|t |grp |
+-----+----+----+
| A |20 |1 |
| A |40 |1 |
| B |10 |2 |
| B |20 |2 |
| B |120 |3 |
| B |140 |3 |
| B |320 |4 |
| B |340 |4 |
| B |380 |5 |
+-----+----+----+
Rules:
Group all Values together that are at least 50ms away from each other. (column t) and belongs to the same event.
When a row of type 7 appears take a cut too and remove this row. (see last row)
The first rule I can achieve with the answer from this thread:
Code:
val windowSpec= Window.partitionBy("event").orderBy("t")
val newSession = (coalesce(
($"t" - lag($"t", 1).over(windowSpec)),
lit(0)
) > 50).cast("bigint")
val sessionized = df.withColumn("session", sum(newSession).over(userWindow))
I have to say I can't figure it out how it works and don't know how to modify it so that rule 2 also works...
Hope someone can give me some useful hints.
What I tried:
val newSession = (coalesce(
($"t" - lag($"t", 1).over(windowSpec)),
lit(0)
) > 50 || lead($"type",1).over(windowSpec) =!= 7 ).cast("bigint")
But only an error occurred: "Must follow method; cannot follow org.apache.spark.sql.Column val grp = (coalesce(
this should do the trick:
val newSession = (coalesce(
($"t" - lag($"t", 1).over(win)),
lit(0)
) > 50
or $"type"===7) // also start new group in this case
.cast("bigint")
df.withColumn("session", sum(newSession).over(win))
.where($"type"=!=7) // remove these rows
.orderBy($"event",$"t")
.show
gives:
+-----+---+----+-------+
|event| t|type|session|
+-----+---+----+-------+
| A| 20| 1| 0|
| A| 40| 1| 0|
| B| 10| 1| 0|
| B| 20| 1| 0|
| B|120| 1| 1|
| B|140| 1| 1|
| B|320| 1| 2|
| B|340| 1| 2|
| B|380| 1| 3|
+-----+---+----+-------+

Postgres select from table and spread evenly

I have a 2 tables. First table contains information of the object, second table contains related objects. Second table objects have 4 types( lets call em A,B,C,D).
I need a query that does something like this
|table1 object id | A |value for A|B | value for B| C | value for C|D | vlaue for D|
| 1 | 12| cat | 13| dog | 2 | house | 43| car |
| 1 | 5 | lion | | | | | | |
The column "table1 object id" in real table is multiple columns of data from table 1(for single object its all the same, just repeated on multiple rows because of table 2).
Where 2nd table is in form
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
I hope this is clear enough of the thing i want.
I have tryed using AND and OR and JOIN. This does not seem like something that can be done with crosstab.
EDIT
Table 2
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
|C |wolf | 2 | 6 |
Table 1
| id | value1 | value 2|value 3|
| 1 | hello | test | hmmm |
| 2 | bye | test2 | hmm2 |
Result
|value1| value2| value3| A| value| B |value| C|value | D | value|
|hello | test | hmmm |12| cat | 13| dog |2 | house | 23| car |
|hello | test | hmmm |5 | lion | | | | | | |
|bye | test2 | hmm2 | | | | |6 | wolf | | |
I hope this explains bit bettter of what I want to achieve.