Dividing a set of columns by its average in Pyspark - pyspark

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input Data
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
Col1 Col2 Col3 Name
1 40 56 john jones
2 45 55 tracey smith
3 33 23 amy sanders
Expected Output
Col1 Col2 Col3 Name
0.5 1.02 1.25 john jones
1 1.14 1.23 tracey smith
1.5 0.84 0.51 amy sanders
Function as of now. Not working:
#function to divide few columns by the column average and overwrite the column
def avg_scaling(df):
#List of columns which have to be scaled by their average
col_list = ['col1', 'col2', 'col3']
for i in col_list:
df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))
return df
new_df = avg_scaling(df)

You can make use of a Window here partitioned on a pusedo column and run average on that window.
The code goes like this,
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 1| 40| 56| john jones|
| 2| 45| 55|tracey smith|
| 3| 33| 23| amy sanders|
+----+----+----+------------+
from pyspark.sql import Window
def avg_scaling(df, cols_to_scale):
w = Window.partitionBy(F.lit(1))
for col in cols_to_scale:
df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
return df
new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 0.5|1.02|1.25| john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+

Related

How to perform conditional join with time column in spark scala

I am looking for help in joining 2 DF's with conditional join in time columns, using Spark Scala.
DF1
time_1
revision
review_1
2022-04-05 08:32:00
1
abc
2022-04-05 10:15:00
2
abc
2022-04-05 12:15:00
3
abc
2022-04-05 09:00:00
1
xyz
2022-04-05 20:20:00
2
xyz
DF2:
time_2
review_1
value
2022-04-05 08:30:00
abc
value_1
2022-04-05 09:48:00
abc
value_2
2022-04-05 15:40:00
abc
value_3
2022-04-05 08:00:00
xyz
value_4
2022-04-05 09:00:00
xyz
value_5
2022-04-05 10:00:00
xyz
value_6
2022-04-05 11:00:00
xyz
value_7
2022-04-05 12:00:00
xyz
value_8
Desired Output DF:
time_1
revision
review_1
value
2022-04-05 08:32:00
1
abc
value_1
2022-04-05 10:15:00
2
abc
value_2
2022-04-05 12:15:00
3
abc
null
2022-04-05 09:00:00
1
xyz
value_6
2022-04-05 20:20:00
2
xyz
null
As in the case of row 4 of the final output (where time_1 = 2022-04-05 09:00:00, if multiple values match during the join then only the latest - in time - should be taken).
Furthermore if there is no match for a row of df in the join then there it should have null for the value column.
Here we need to join between 2 columns in the two DF's:
review_1 === review_2 &&
time_1 === time_2 (condition : time_1 should be in range +1/-1 Hr from time_2, If multiple records then show latest value, as in value_6 above)
Here is the code necessary to join the DataFrames:
I have commented the code so as to explain the logic.
TL;DR
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
Full breakdown
Let's start off with your DataFrames: df1 and df2 in code:
val df1 = List(
("2022-04-05 08:32:00", 1, "abc"),
("2022-04-05 10:15:00", 2, "abc"),
("2022-04-05 12:15:00", 3, "abc"),
("2022-04-05 09:00:00", 1, "xyz"),
("2022-04-05 20:20:00", 2, "xyz")
).toDF("time_1", "revision", "review_1")
df1.show(false)
gives:
+-------------------+--------+--------+
|time_1 |revision|review_1|
+-------------------+--------+--------+
|2022-04-05 08:32:00|1 |abc |
|2022-04-05 10:15:00|2 |abc |
|2022-04-05 12:15:00|3 |abc |
|2022-04-05 09:00:00|1 |xyz |
|2022-04-05 20:20:00|2 |xyz |
+-------------------+--------+--------+
val df2 = List(
("2022-04-05 08:30:00", "abc", "value_1"),
("2022-04-05 09:48:00", "abc", "value_2"),
("2022-04-05 15:40:00", "abc", "value_3"),
("2022-04-05 08:00:00", "xyz", "value_4"),
("2022-04-05 09:00:00", "xyz", "value_5"),
("2022-04-05 10:00:00", "xyz", "value_6"),
("2022-04-05 11:00:00", "xyz", "value_7"),
("2022-04-05 12:00:00", "xyz", "value_8")
).toDF("time_2", "review_2", "value")
df2.show(false)
gives:
+-------------------+--------+-------+
|time_2 |review_2|value |
+-------------------+--------+-------+
|2022-04-05 08:30:00|abc |value_1|
|2022-04-05 09:48:00|abc |value_2|
|2022-04-05 15:40:00|abc |value_3|
|2022-04-05 08:00:00|xyz |value_4|
|2022-04-05 09:00:00|xyz |value_5|
|2022-04-05 10:00:00|xyz |value_6|
|2022-04-05 11:00:00|xyz |value_7|
|2022-04-05 12:00:00|xyz |value_8|
+-------------------+--------+-------+
Next we need new columns which we can do the date range check on (where time is represented as a single number, making math operations easy:
// add a new column, temporarily, which contains the time in
// epoch format: with this adding/subtracting an hour can easily be done.
val df1WithEpoch = df1.withColumn("epoch_time_1", unix_timestamp(col("time_1")))
val df2WithEpoch = df2.withColumn("epoch_time_2", unix_timestamp(col("time_2")))
df1WithEpoch.show()
df2WithEpoch.show()
gives:
+-------------------+--------+--------+------------+
| time_1|revision|review_1|epoch_time_1|
+-------------------+--------+--------+------------+
|2022-04-05 08:32:00| 1| abc| 1649147520|
|2022-04-05 10:15:00| 2| abc| 1649153700|
|2022-04-05 12:15:00| 3| abc| 1649160900|
|2022-04-05 09:00:00| 1| xyz| 1649149200|
|2022-04-05 20:20:00| 2| xyz| 1649190000|
+-------------------+--------+--------+------------+
+-------------------+--------+-------+------------+
| time_2|review_2| value|epoch_time_2|
+-------------------+--------+-------+------------+
|2022-04-05 08:30:00| abc|value_1| 1649147400|
|2022-04-05 09:48:00| abc|value_2| 1649152080|
|2022-04-05 15:40:00| abc|value_3| 1649173200|
|2022-04-05 08:00:00| xyz|value_4| 1649145600|
|2022-04-05 09:00:00| xyz|value_5| 1649149200|
|2022-04-05 10:00:00| xyz|value_6| 1649152800|
|2022-04-05 11:00:00| xyz|value_7| 1649156400|
|2022-04-05 12:00:00| xyz|value_8| 1649160000|
+-------------------+--------+-------+------------+
and finally to join:
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
gives:
+-------------------+--------+--------+-------+
|time_1 |revision|review_1|value |
+-------------------+--------+--------+-------+
|2022-04-05 08:32:00|1 |abc |value_1|
|2022-04-05 10:15:00|2 |abc |value_2|
|2022-04-05 12:15:00|3 |abc |null |
|2022-04-05 09:00:00|1 |xyz |value_6|
|2022-04-05 20:20:00|2 |xyz |null |
+-------------------+--------+--------+-------+

Creating a new column using info from another df

I'm trying to create a new column based off information from another data table.
df1
Loc Time Wage
1 192 1
3 192 2
1 193 3
5 193 3
7 193 5
2 194 7
df2
Loc City
1 NYC
2 Miami
3 LA
4 Chicago
5 Houston
6 SF
7 DC
desired output:
Loc Time Wage City
1 192 1 NYC
3 192 2 LA
1 193 3 NYC
5 193 3 Houston
7 193 5 DC
2 194 7 Miami
The actual dataframes vary quite largely in terms of row numbers, but its something along the lines of that. I think this might be achievable through .map but I haven't found much documentation for that online. join doesn't really seem to fit this situation.
join is exactly what you need. Try running this in the spark-shell
import spark.implicits._
val col1 = Seq("loc", "time", "wage")
val data1 = Seq((1, 192, 1), (3, 193, 2), (1, 193, 3), (5, 193, 3), (7, 193, 5), (2, 194, 7))
val col2 = Seq("loc", "city")
val data2 = Seq((1, "NYC"), (2, "Miami"), (3, "LA"), (4, "Chicago"), (5, "Houston"), (6, "SF"), (7, "DC"))
val df1 = spark.sparkContext.parallelize(data1).toDF(col1: _*)
val df2 = spark.sparkContext.parallelize(data2).toDF(col2: _*)
val outputDf = df1.join(df2, Seq("loc")) // join on the column "loc"
outputDf.show()
This will output
+---+----+----+-------+
|loc|time|wage| city|
+---+----+----+-------+
| 1| 192| 1| NYC|
| 1| 193| 3| NYC|
| 2| 194| 7| Miami|
| 3| 193| 2| LA|
| 5| 193| 3|Houston|
| 7| 193| 5| DC|
+---+----+----+-------+

New column creation based on if and else condition using pyspark

I have 2 spark dataframes, and I want to add new column named "seg" to dataframe df2 based on below condition
if df2.colx value is present in df1.colx.
I tried below operation in pyspark but its throwing exception.
cc002 = df2.withColumn('seg',F.when(df2.colx == df1.colx,"True").otherwise("FALSE"))
df1 :
id colx coly
1 678 56789
2 900 67890
3 789 67854
df2
Name colx
seema 900
yash 678
deep 800
harsh 900
My expected Output is
Name colx seg
seema 900 True
harsh 900 True
yash 678 True
deep 800 False
Please help me correcting the given pyspark code or suggest the better way of doing it.
If I understand your question correctly what you want to do is this
res = df2.join(
df1,
on="colx",
how = "left"
).select(
"Name",
"colx"
).withColumn(
"seg",
F.when(F.col(colx).isNull(),F.lit(True)).otherwise(F.lit(False))
)
let me know if this is the solution you want.
my bad i did write the incorrect code in hurry below is the corrected one
import pyspark.sql.functions as F
df1 = sqlContext.createDataFrame([[1,678,56789],[2,900,67890],[3,789,67854]],['id', 'colx', 'coly'])
df2 = sqlContext.createDataFrame([["seema",900],["yash",678],["deep",800],["harsh",900]],['Name', 'colx'])
res = df2.join(
df1.withColumn(
"check",
F.lit(1)
),
on="colx",
how = "left"
).withColumn(
"seg",
F.when(F.col("check").isNotNull(),F.lit(True)).otherwise(F.lit(False))
).select(
"Name",
"colx",
"seg"
)
res.show()
+-----+----+-----+
| Name|colx| seg|
+-----+----+-----+
| yash| 678| true|
|seema| 900| true|
|harsh| 900| true|
| deep| 800|false|
+-----+----+-----+
You can join on colx and fill null values with False:
result = (df2.join(df1.select(df1['colx'], F.lit(True).alias('seg')),
on='colx',
how='left')
.fillna(False, subset='seg'))
result.show()
Output:
+----+-----+-----+
|colx| Name| seg|
+----+-----+-----+
| 900|seema| true|
| 900|harsh| true|
| 800| deep|false|
| 678| yash| true|
+----+-----+-----+

Get the number of null per row in PySpark dataframe

This is probably a duplicate, but somehow I have been searching for a long time already:
I want to get the number of nulls per Row in a Spark dataframe. I.e.
col1 col2 col3
null 1 a
1 2 b
2 3 null
Should in the end be:
col1 col2 col3 number_of_null
null 1 a 1
1 2 b 0
2 3 null 1
In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.
I.e.
col1 col2 col3 number_of_ABC
ABC 1 a 1
1 2 b 0
2 ABC ABC 2
I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/
EDIT: Assume that I have so many columns that I can't list them all.
EDIT2: I explicitely dont want to have a pandas solution.
EDIT3: The solution explained with sums or means does not work as it throws errors:
(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))
In Scala:
val df = List(
("ABC", "1", "a"),
("1", "2", "b"),
("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)
Output:
+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1 |a |1 |
|1 |2 |b |0 |
|2 |ABC |ABC |2 |
+----+----+----+--------+
As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7
Taking your DataFrame as df (and as is)
dfvals = [
(None, "1", "a"),
("1", "2", "b"),
("2", None, None)
]
df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])
new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
df.schema.names)))
Check if the value is Null and assign 1 or 0. Add the result to get the count.
new_df.show()
+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null| 1| a| 1|
| 1| 2| b| 0|
| 2|null|null| 2|
+----+----+----+--------+

Comparing DataFrames in Spark

I have 2 dataframes
df1
+----------+----------------+--------------------+--------------+-------------+
| WEEK|DIM1 |DIM2 |T1 | T2 |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02| 14| NULL| 9874| 880 |
|2016-04-30| 14|FR | 9875| 13 |
|2017-06-10| 15| PQR| 9867| 57721 |
+----------+----------------+--------------------+--------------+-------------+
df2
+----------+----------------+--------------------+--------------+-------------+
| WEEK|DIM1 |DIM2 |T1 | T2 |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02| 14| NULL| 9879| 820 |
|2016-04-30| 14|FR | 9785| 9 |
|2017-06-10| 15| XYZ| 9967| 57771 |
+----------+----------------+--------------------+--------------+-------------+
I want to write a comparator in spark which compares T1, T2 in both dataframes upon WEEK, DIM1, DIM2 with T1, T2 in df1 should be greater than T1, T2 by 3. I want to return all rows which do not match the above criterion with difference between T1, T2 among dataframes. I also want to have rows present in df1 not present in df2 and vice versa for the following combination WEEK, DIM1, DIM2.
The output should be like this
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
| WEEK|DIM1 |DIM2 |T1_dIFF | T2_dIFF | Presenent_In_DF1 | Presenent_In_DF2|
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
|2016-04-30| 14|FR | 90| 4 | Y | Y |
|2017-06-10| 15|PQR | 9867| 57721 | Y | N |
|2017-06-10| 15|XYZ | 9967| 57771 | N | Y |
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
What is the best way to go around this ?
I have implemented the following but do not know how to proceed after this -
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
val df2 = Seq(
("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
import org.apache.spark.sql.functions._
val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")
The joined look like this -
+----------+----+----+----+-----+----+-----+
| WEEK|DIM1|DIM2| T1| T2| T1| T2|
+----------+----+----+----+-----+----+-----+
|2016-04-02| 14|NULL|9874| 880|9879| 820|
|2017-06-10| 15| PQR|9867|57721|null| null|
|2017-06-10| 15| XYZ|null| null|9967|57771|
|2016-04-30| 14| FR|9875| 13|9785| 9|
+----------+----+----+----+-----+----+-----+
I do not know how to proceed after this in a good way, relatively new to scala.
One easy solution could be to join df1 and df2 with the WEEK as unique Key. In the joined data you need to keep all the columns from df1 and df2.
Then you can do a map operation on the dataframe to produce the rest of the columns.
Something like
df1.createOrReplaceTempTable("df1")
df2.createOrReplaceTempTable("df2")
val df = spark.sql("select df1.*, df2.DIM1 as df2_DIM1, df2.DIM2 as df2_DIM2, df2.T1 as df2_T1, df2.T2 as df2_T2 from df1 join df2 on df1.WEEK = df2.WEEK")
// Now map on the dataframe to produce the diff dataframe
// Or you can use the SQL to do that.