How to keep the maximum value of a column along with other columns in a pyspark dataframe? - pyspark

Consider that I have this dataframe in pyspark:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-03-2014 07:33| 4.5| 90041 |
| 00236|11-04-2014 05:43| 7.2| 90024 |
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-06-2014 05:55| 5.6| 90037 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
| 00235|11-02-2014 05:33| 4.3| 90029 |
| 00235|11-09-2014 05:33| 4.2| 90047 |
+--------+----------------+---------+---------+
How can I write a pyspark script to keep the maximum value of range column along with other columns in this pyspark dataframe? The output will be like this:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
+--------+----------------+---------+---------+

Using Window and row_number():
from pyspark.sql.window import Window
w=Window().partitionBy("DeviceID")
df.withColumn("rank", row_number().over(w.orderBy(desc("range"))))\
.filter(col("rank")==1)\
.drop("rank").show()
Output:
+--------+----------------+-----+-------+
|DeviceID| TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
| 00236|11-05-2014 05:43| 8.5| 90026|
| 00234|11-01-2014 05:55| 9.2| 90032|
| 00235|11-05-2014 05:33| 4.3| 90082|
+--------+----------------+-----+-------+

Related

What is the best practice to handle non-datetime timestamp column within pandas dataframe?

Let's say I have the following pandas dataframe with a non-standard timestamp column without datetime format. Due to I need to include a new column and convert it into an 24hourly-based timestamp for time-series visualizing matter by:
df['timestamp(24hrs)'] = round(df['timestamp(sec)']/24*3600)
and get this:
+----------------+----+-----+
|timestamp(24hrs)|User|count|
+----------------+----+-----+
|0.0 |U100|435 |
|1.0 |U100|1091 |
|2.0 |U100|992 |
|3.0 |U100|980 |
|4.0 |U100|288 |
|8.0 |U100|260 |
|9.0 |U100|879 |
|10.0 |U100|875 |
|11.0 |U100|911 |
|13.0 |U100|628 |
|14.0 |U100|642 |
|16.0 |U100|631 |
|17.0 |U100|233 |
... ... ...
|267.0 |U100|1056 |
|269.0 |U100|878 |
|270.0 |U100|256 |
+----------------+----+-----+
Now I noticed that some records' timestamps are missing, and I need to impute those missing data:
timestamp(24hrs) in continuous order
count value by 0
Expected output:
+----------------+----+-----+
|timestamp(24hrs)|User|count|
+----------------+----+-----+
|0.0 |U100|435 |
|1.0 |U100|1091 |
|2.0 |U100|992 |
|3.0 |U100|980 |
|4.0 |U100|288 |
|5.0 |U100|0 |
|6.0 |U100|0 |
|7.0 |U100|0 |
|8.0 |U100|260 |
|9.0 |U100|879 |
|10.0 |U100|875 |
|11.0 |U100|911 |
|12.0 |U100|0 |
|13.0 |U100|628 |
|14.0 |U100|642 |
|15.0 |U100|0 |
|16.0 |U100|631 |
|17.0 |U100|233 |
... ... ...
|267.0 |U100|1056 |
|268.0 |U100|0 |
|269.0 |U100|878 |
|270.0 |U100|256 |
+----------------+----+-----+
Any idea how can I do this? Based on this answer over standard timestamp, I can imagine I need to create a new column timestamp(24hrs) from the start and end of the previous one and do left join() & crossJoin() but I couldn't manage it yet.
I've tried the following unsuccessfully:
import pyspark.sql.functions as F
all_dates_df = df.selectExpr(
"sequence(min(timestamp(24hrs)), max(timestamp(24hrs)), interval 1 hour) as hour"
).select(F.explode("timestamp(24hrs)").alias("timestamp(24hrs)"))
all_dates_df.show()
result_df = all_dates_df.crossJoin(
df.select("UserName").distinct()
).join(
df,
["count", "timestamp(24hrs)"],
"left"
).fillna(0)
result_df.show()
sequence function is available for integer. It doesn't work for double type so it requires to cast to integer then cast back to double (if you want to retain as double).
df_seq = (df.withColumn('time_int', F.col('timestamp(24hrs)').cast(IntegerType()))
.select(F.explode(F.sequence(F.min('time_int'), F.max('time_int'))).alias('timestamp(24hrs)'))
.select(F.col('timestamp(24hrs)').cast(DoubleType())))
df = (df_seq.crossJoin(df.select("User").distinct())
.join(df, on=['User', 'timestamp(24hrs)'], how='left')
.fillna(0))

How to create a new column based on if certain strings exist in another column?

I have a table looks like this:
+--------+-------------+
| Time | Locations |
+--------+-------------+
| 1/1/22 | A300-abc |
+--------+-------------+
| 1/2/22 | A300-FFF |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc |
+--------+-------------+
| 1/5/22 | B750-EEE |
+--------+-------------+
| 1/6/22 | M-200-68 |
+--------+-------------+
| 1/7/22 | ABC-abc |
+--------+-------------+
I would like to derive to a table that looks like this:
+--------+-------------+-----------------+
| Time | Locations | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc | A300 |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF | A300 |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300 |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc | B700 |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE | B750 |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68 | M-200 |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc | "not_listed" |
+--------+-------------+-----------------+
Essentially I have a list of what the location code should be e.g. ["A300","B700","B750","M-200"], but currently the location column is very messy with other random strings. I want to create a new column that shows the "cleaned" version of the location code, and anything that is not in that list should be marked as "not_listed".
Use regex and when condition. In this case I check if string begins with a digit ^[0-9] then extract the the leading digits in the string. If it doesn then attribute it with not listed. Code below
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()
+--------------------+---------+---------------+
| Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456| 300abc| 300|
|0.022727272727272728| 300FFF| 300|
| 0.01515151515151515| 300ABC| 300|
|0.011363636363636364| 700abc| 700|
|0.009090909090909092| 750EEE| 750|
|0.007575757575757575| ABCabc| not_listed|
+--------------------+---------+---------------+
With your new question, use regexp_replace
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))
+------+-----------+---------------+
| Time| Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22| A300-abc| A300|
|1/2/22| A300-FFF| A300|
|1/3/22|A300-ABC123| A300|
|1/4/22| B700-abc| B700|
|1/5/22| B750-EEE| B750|
|1/7/22| M-200-68| M-200|
|1/6/22| ABCabc| not_listed|
+------+-----------+---------------+

Mean with differents columns ignoring Null values, Spark Scala

I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+

How to combine pyspark dataframes with different shapes and different columns

I have two dataframes in Pyspark. One has more than 1000 rows and the other only 4 rows. The columns also are not matching.
df1 with more than 1000 rows:
+----+--------+--------------+-------------+
| ID | col1 | col2 | col 3 |
+----+--------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 |
| 2 | time 2 | value2_col2 | value2_col3 |
+----+--------+--------------+-------------+
...
df2 with only 4 rows:
+-----+--------------+--------------+
| key | col_c | col_d |
+-----+--------------+--------------+
| a | valuea_colc | valuea_cold |
| b | valueb_colc | valueb_cold |
+-----+--------------+--------------+
I want to create a dataframe looking like this:
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| ID | col1 | col2 | col 3 | a_col_c | a_col_d | b_col_c | b_col_d |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
| 2 | time 2 | value2_col2 | value2_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
Can you please help with this? I prefer not to use Pandas.
Thank you!
I actually figured this out using crossJoin.
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html explains how to use crossJoin with Pyspark DataFrames.

Missing spark partition column in partition table

I am creating a partitioned parquet file in HDFS with a datasource.
The datasource looks like:
scala> sqlContext.sql("select * from parquetFile").show()
+--------+-----------------+
|area_tag| vin|
+--------+-----------------+
| 0|LSKG5GC19BA210794|
| 0|LSKG5GC15BA210372|
| 0|LSKG5GC18BA210107|
| 0|LSKG4GC16BA211971|
| 0|LSKG4GC19BA210233|
| 0|LSKG5GC17BA210017|
| 0|LSKG4GC19BA211785|
| 0|LSKG4GC15BA210004|
| 0|LSKG4GC12BA211739|
| 0|LSKG4GC18BA210238|
| 0|LSKG4GC13BA210261|
| 0|LSKG5GC16BA210106|
| 0|LSKG4GC1XBA210287|
| 0|LSKG4GC10BA210265|
| 0|LSKG5GC10CA210118|
| 0|LSKG5GC16BA212289|
| 0|LSKG5GC1XBA211016|
| 0|LSKG5GC15CA210194|
| 0|LSKG5GC12CA210119|
| 0|LSKG4GC19BA211379|
+--------+-----------------+
I create partition with the following commands (I did it in spark shell):
scala>val df1 = sqlContext.sql("select * from parquetFile where area_tag=0 ")
scala>df1.write.parquet("/tmp/test_table3/area_tag=0")
scala>val p1 = sqlContext.read.parquet("/tmp/test_table3")
When I print the data by loading from the partitioned table, it shows:
scala> p1.show()
+--------+-----------------+
|area_tag| vin|
+--------+-----------------+
| |LSKG5GC19BA210794|
| |LSKG5GC15BA210372|
| |LSKG5GC18BA210107|
| |LSKG4GC16BA211971|
| |LSKG4GC19BA210233|
| |LSKG5GC17BA210017|
| |LSKG4GC19BA211785|
| |LSKG4GC15BA210004|
| |LSKG4GC12BA211739|
| |LSKG4GC18BA210238|
| |LSKG4GC13BA210261|
| |LSKG5GC16BA210106|
| |LSKG4GC1XBA210287|
| |LSKG4GC10BA210265|
| |LSKG5GC10CA210118|
| |LSKG5GC16BA212289|
| |LSKG5GC1XBA211016|
| |LSKG5GC15CA210194|
| |LSKG5GC12CA210119|
| |LSKG4GC19BA211379|
+--------+-----------------+
only showing top 20 rows
The partition column was missing. What happened with the column, is it a bug?