What is the best practice to handle non-datetime timestamp column within pandas dataframe? - pyspark

Let's say I have the following pandas dataframe with a non-standard timestamp column without datetime format. Due to I need to include a new column and convert it into an 24hourly-based timestamp for time-series visualizing matter by:
df['timestamp(24hrs)'] = round(df['timestamp(sec)']/24*3600)
and get this:
+----------------+----+-----+
|timestamp(24hrs)|User|count|
+----------------+----+-----+
|0.0 |U100|435 |
|1.0 |U100|1091 |
|2.0 |U100|992 |
|3.0 |U100|980 |
|4.0 |U100|288 |
|8.0 |U100|260 |
|9.0 |U100|879 |
|10.0 |U100|875 |
|11.0 |U100|911 |
|13.0 |U100|628 |
|14.0 |U100|642 |
|16.0 |U100|631 |
|17.0 |U100|233 |
... ... ...
|267.0 |U100|1056 |
|269.0 |U100|878 |
|270.0 |U100|256 |
+----------------+----+-----+
Now I noticed that some records' timestamps are missing, and I need to impute those missing data:
timestamp(24hrs) in continuous order
count value by 0
Expected output:
+----------------+----+-----+
|timestamp(24hrs)|User|count|
+----------------+----+-----+
|0.0 |U100|435 |
|1.0 |U100|1091 |
|2.0 |U100|992 |
|3.0 |U100|980 |
|4.0 |U100|288 |
|5.0 |U100|0 |
|6.0 |U100|0 |
|7.0 |U100|0 |
|8.0 |U100|260 |
|9.0 |U100|879 |
|10.0 |U100|875 |
|11.0 |U100|911 |
|12.0 |U100|0 |
|13.0 |U100|628 |
|14.0 |U100|642 |
|15.0 |U100|0 |
|16.0 |U100|631 |
|17.0 |U100|233 |
... ... ...
|267.0 |U100|1056 |
|268.0 |U100|0 |
|269.0 |U100|878 |
|270.0 |U100|256 |
+----------------+----+-----+
Any idea how can I do this? Based on this answer over standard timestamp, I can imagine I need to create a new column timestamp(24hrs) from the start and end of the previous one and do left join() & crossJoin() but I couldn't manage it yet.
I've tried the following unsuccessfully:
import pyspark.sql.functions as F
all_dates_df = df.selectExpr(
"sequence(min(timestamp(24hrs)), max(timestamp(24hrs)), interval 1 hour) as hour"
).select(F.explode("timestamp(24hrs)").alias("timestamp(24hrs)"))
all_dates_df.show()
result_df = all_dates_df.crossJoin(
df.select("UserName").distinct()
).join(
df,
["count", "timestamp(24hrs)"],
"left"
).fillna(0)
result_df.show()

sequence function is available for integer. It doesn't work for double type so it requires to cast to integer then cast back to double (if you want to retain as double).
df_seq = (df.withColumn('time_int', F.col('timestamp(24hrs)').cast(IntegerType()))
.select(F.explode(F.sequence(F.min('time_int'), F.max('time_int'))).alias('timestamp(24hrs)'))
.select(F.col('timestamp(24hrs)').cast(DoubleType())))
df = (df_seq.crossJoin(df.select("User").distinct())
.join(df, on=['User', 'timestamp(24hrs)'], how='left')
.fillna(0))

Related

Flatten all map columns recursively in PySpark dataframe

I have a pyspark dataframe with multiple map columns. I want to flatten all map columns recursively. personal and financial are map type columns. Similarly, we might have more map columns.
Input dataframe:
-------------------------------------------------------------------------------------------------------
| id | name | Gender | personal | financial |
-------------------------------------------------------------------------------------------------------
| 1 | A | M | {age:20,city:Dallas,State:Texas} | {salary:10000,bonus:2000,tax:1500}|
| 2 | B | F | {city:Houston,State:Texas,Zipcode:77001} | {salary:12000,tax:1800} |
| 3 | C | M | {age:22,city:San Jose,Zipcode:940088} | {salary:2000,bonus:500} |
-------------------------------------------------------------------------------------------------------
Output dataframe:
--------------------------------------------------------------------------------------------------------------
| id | name | Gender | age | city | state | Zipcode | salary | bonus | tax |
--------------------------------------------------------------------------------------------------------------
| 1 | A | M | 20 | Dallas | Texas | null | 10000 | 2000 | 1500 |
| 2 | B | F | null | Houston | Texas | 77001 | 12000 | null | 1800 |
| 3 | C | M | 22 | San Jose | null | 940088 | 2000 | 500 | null |
--------------------------------------------------------------------------------------------------------------
use map_concat to merge the map fields and then explode them. exploding a map column creates 2 new columns - key and value. pivot the key column with value as values to get your desired output.
data_sdf. \
withColumn('personal_financial', func.map_concat('personal', 'financial')). \
selectExpr(*[c for c in data_sdf.columns if c not in ['personal', 'financial']],
'explode(personal_financial)'
). \
groupBy([c for c in data_sdf.columns if c not in ['personal', 'financial']]). \
pivot('key'). \
agg(func.first('value')). \
show(truncate=False)
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |id |name|gender|State|Zipcode|age |bonus|city |salary|tax |
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |1 |A |M |Texas|null |20 |2000 |Dallas |10000 |1500|
# |2 |B |F |Texas|77001 |null|null |Houston |12000 |1800|
# |3 |C |M |null |940088 |22 |500 |San Jose|2000 |null|
# +---+----+------+-----+-------+----+-----+--------+------+----+

pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+
First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)
For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

How to keep the maximum value of a column along with other columns in a pyspark dataframe?

Consider that I have this dataframe in pyspark:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-03-2014 07:33| 4.5| 90041 |
| 00236|11-04-2014 05:43| 7.2| 90024 |
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-06-2014 05:55| 5.6| 90037 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
| 00235|11-02-2014 05:33| 4.3| 90029 |
| 00235|11-09-2014 05:33| 4.2| 90047 |
+--------+----------------+---------+---------+
How can I write a pyspark script to keep the maximum value of range column along with other columns in this pyspark dataframe? The output will be like this:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------
| 00236|11-05-2014 05:43| 8.5| 90026 |
| 00234|11-01-2014 05:55| 9.2| 90032 |
| 00235|11-05-2014 05:33| 4.3| 90082 |
+--------+----------------+---------+---------+
Using Window and row_number():
from pyspark.sql.window import Window
w=Window().partitionBy("DeviceID")
df.withColumn("rank", row_number().over(w.orderBy(desc("range"))))\
.filter(col("rank")==1)\
.drop("rank").show()
Output:
+--------+----------------+-----+-------+
|DeviceID| TimeStamp|range|zipcode|
+--------+----------------+-----+-------+
| 00236|11-05-2014 05:43| 8.5| 90026|
| 00234|11-01-2014 05:55| 9.2| 90032|
| 00235|11-05-2014 05:33| 4.3| 90082|
+--------+----------------+-----+-------+

How can I add one column to other columns in PySpark?

I have the following PySpark DataFrame where each column represents a time series and I'd like to study their distance to the mean.
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1 | 2 | ... | 2 |
| -1 | 5 | ... | 4 |
+----+----+-----+---------+
This is what I'm hoping to get:
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0 | ... | 2 |
| -5 | 1 | ... | 4 |
+----+----+-----+---------+
Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.
cols = ["T1", "T2", ...]
for c in cols:
df = df.withColumn(c, df[c] - df["Average"])
Is there a better way to do this transformation of adding one column to many other?
By using rdd, it can be done in this way.
+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1 |2 |2 |
|-1 |5 |4 |
+---+---+-------+
df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
.toDF(df.columns).show()
+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1| 0| 2|
| -5| 1| 4|
+---+---+-------+

Scala group by with mapped keys

I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.
Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+