six months record of every account number in pyspark - pyspark

I have tried rank function but it gives results in numbers which goes beyond 180 days.
This is the result I am getting but I do not want this result this is wrong it is giving transaction beyond 180 days
window = Window.partitionBy(df3['acctno']).orderBy(df3['trans_date'])
df3.select('*', rank().over(window).alias('rank')) \
.filter(col('rank') <= 180) \
.show(500)
{ |year|month|day|date| txnrefid| acc|branch|channel|rank|
+----+-----+---+-----------+-----------+--------------+----------+-----------+----+
|2020| 2| 6| 2020-02-06| 1234abcd6| 2074-556-1111| 6666| CBS| 1|
|2020| 2| 7| 2020-02-07| 1234abcd7| 2074-556-1111| 6666| CBS| 2|
|2020| 2| 8| 2020-02-08| 1234abcd8| 2074-556-1111| 6666| CBS| 3|
|2020| 2| 9| 2020-02-09| 1234abcd9| 2074-556-1111| 6666| CBS| 4|
But I want like this
{|year|month|day|date| txnrefid| acc|branch|channel|rank|
|2020| 2| 6| 2020-02-06| 1234abcd6| 2074-556-1111| 6666| CBS| 1|
|2020| 2| 7| 2020-02-07| 1234abcd7| 2074-556-1111| 6666| CBS| 2|
|2020| 2| 8| 2020-02-08| 1234abcd8| 2074-556-1111| 6666| CBS| 3|
|2020| 2| 9| 2020-02-09| 1234abcd9| 2074-556-1111| 6666|
}

As you edited your question, here is a new answer that use a different approach.
The idea is to get for each account number the min date, compute the limite date (min date + 180 days) then remove all the lines that are older.
df.count() # I used your sample data, so 60 lines
> 60
w = Window.partitionBy(df["acctno"])
df = df.withColumn("min_date", F.min(F.col("trans_date").cast("date")).over(w))
df = df.where(
F.col("trans_date").cast("date")
<= F.date_add( # Use F.date_add to add days or F.add_months to add month.
F.col("min_date"), 180
)
).drop("min_date")
df.count() # Final dataframe limited to 180 days, nothing older than 2020-08-04
> 54

if you want the first 6 month, then you should use fields "year" and "month", not "trans_date".
something like window = Window.partitionBy(df3['acctno']).orderBy(df3['year'], df3['month']) should give you better results.
Then you filter on rank <= 6
df3.select("*", dense_rank().over(window).alias("rank")).filter(col("rank") <= 6).show(500)
EDIT: You need to use a dense_rank

Related

PySpark convert Dataframe to Dictionary

I got the following DataFrame:
>>> df.show(50)
+--------------------+-------------+----------------+----+
| User Hash ID| Word|sum(Total Count)|rank|
+--------------------+-------------+----------------+----+
|00095808cdc611fb5...| errors| 5| 1|
|00095808cdc611fb5...| text| 3| 2|
|00095808cdc611fb5...| information| 3| 3|
|00095808cdc611fb5...| department| 2| 4|
|00095808cdc611fb5...| error| 2| 5|
|00095808cdc611fb5...| data| 2| 6|
|00095808cdc611fb5...| web| 2| 7|
|00095808cdc611fb5...| list| 2| 8|
|00095808cdc611fb5...| recognition| 2| 9|
|00095808cdc611fb5...| pipeline| 2| 10|
|000ac87bf9c1623ee...|consciousness| 14| 1|
|000ac87bf9c1623ee...| book| 3| 2|
|000ac87bf9c1623ee...| place| 2| 3|
|000ac87bf9c1623ee...| mystery| 2| 4|
|000ac87bf9c1623ee...| mental| 2| 5|
|000ac87bf9c1623ee...| flanagan| 2| 6|
|000ac87bf9c1623ee...| account| 2| 7|
|000ac87bf9c1623ee...| world| 2| 8|
|000ac87bf9c1623ee...| problem| 2| 9|
|000ac87bf9c1623ee...| theory| 2| 10|
This shows some for each user the 10 most frequent words he read.
I would like to create a dictionary, which then can be saved to a file, with the following format:
User : <top 1 word>, <top 2 word> .... <top 10 word>
To achieve this, I thought it might be more efficient to cut down the df as much as possible, before converting it. Thus, I tried:
>>> df.groupBy("User Hash ID").agg(collect_list("Word")).show(20)
+--------------------+--------------------+
| User Hash ID| collect_list(Word)|
+--------------------+--------------------+
|00095808cdc611fb5...|[errors, text, in...|
|000ac87bf9c1623ee...|[consciousness, b...|
|0038ccf6e16121e7c...|[potentials, orga...|
|0042bfbafc6646f47...|[fuel, car, consu...|
|00a19396b7bb52e40...|[face, recognitio...|
|00cec95a2c007b650...|[force, energy, m...|
|00df9406cbab4575e...|[food, history, w...|
|00e6e2c361f477e1c...|[image, based, al...|
|01636d715de360576...|[functional, lang...|
|01a778c390e44a8c3...|[trna, genes, pro...|
|01ab9ade07743d66b...|[packaging, car, ...|
|01bdceea066ec01c6...|[anthropology, de...|
|020c643162f2d581b...|[laser, electron,...|
|0211604d339d0b3db...|[food, school, ve...|
|0211e8f09720c7f47...|[privacy, securit...|
|021435b2c4523dd31...|[life, rna, origi...|
|0239620aa740f1514...|[method, image, d...|
|023ad5d85a948edfc...|[web, user, servi...|
|02416836b01461574...|[parts, based, ad...|
|0290152add79ae1d8...|[data, score, de,...|
+--------------------+--------------------+
From here, it should be more straight forward to generate that dictionary However, I cannot be sure if by using this agg function I am guaranteed that the words are in the correct order! That is why I am hesitant and wanted to get some feedback on maybe better options
Based on answers provided here - collect_list by preserving order based on another variable
you can write below query to make sure you have top 5 in correct order
import pyspark.sql.functions as F
grouped_df = dft.groupby("userid") \
.agg(F.sort_array(F.collect_list(F.struct("rank", "word"))) \
.alias("collected_list")) \
.withColumn("sorted_list",F.slice(F.col("collected_list.word"),start=1,length=5)) \
.drop("collected_list")\
.show(truncate=False)
First of all, if you go from a dataframe to a dictionary, you may have to face some memory issue as you will bring all the content of the dataframe to your driver (dictionary is a python object, not a spark object).
You are not that far away from a working solution. I'd do it that way :
from pyspark.sql import functions as F
df.groupBy("User Hash ID").agg(
F.collect_list(F.struct("Word", "sum(Total Count)", "rank")).alias("data")
)
This will create a data column where you have your 3 fields, aggregated by user id.
Then, to go from a dataframe to a dict object, you can use for example toJSON or Row object method asDict

(pyspark)How to Divide Time Intervals into Time Periods

I have a dataframe created by sparksql with IDs corresponding to checkin_datetime and checkout_datetime.As the picture shows.
I would like to divide this time interval into one-hour time periods. As the picture shows.
Code to create sparkdataframe:
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df)
To compute hourly interval,
First explode hourly intervals between checkin_datetime and checkout_datetime. We do this by computing the hours between the checkin_datetime and checkout_datetime and iterating over the range to generate the intervals.
Once we have exploded the intervals to find the next_hour, we can use this to identify the gap between checkin_datetime and next_hour or checkout_datetime and next_hour.
from pyspark.sql import functions as F
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df).withColumn("checkin_datetime", F.to_timestamp("checkin_datetime", "MM-dd-yyyy HH:mm")).withColumn("checkout_datetime", F.to_timestamp("checkout_datetime", "MM-dd-yyyy HH:mm"))
unix_checkin = F.unix_timestamp("checkin_datetime")
unix_checkout = F.unix_timestamp("checkout_datetime")
start_hour_checkin = F.date_trunc("hour", "checkin_datetime")
unix_start_hour_checkin = F.unix_timestamp(start_hour_checkin)
checkout_next_hour = F.date_trunc("hour", "checkout_datetime") + F.expr("INTERVAL 1 HOUR")
diff_hours = F.floor((unix_checkout - unix_start_hour_checkin) / 3600)
next_hour = F.explode(F.transform(F.sequence(F.lit(0), diff_hours), lambda x: F.to_timestamp(F.unix_timestamp(start_hour_checkin) + (x + 1) * 3600)))
minute = (F.when(start_hour_checkin == F.date_trunc("hour", "checkout_datetime"), (unix_checkout - unix_checkin) / 60)
.when(checkout_next_hour == F.col("next_hour"), (unix_checkout - F.unix_timestamp(F.date_trunc("hour", "checkout_datetime"))) / 60)
.otherwise(F.least((F.unix_timestamp(F.col("next_hour")) - unix_checkin) / 60, F.lit(60)))
).cast("int")
(df1.withColumn("next_hour", next_hour)
.withColumn("minutes", minute)
.withColumn("hr", F.date_format(F.expr("next_hour - INTERVAL 1 HOUR"), "H"))
.withColumn("day", F.to_date(F.expr("next_hour - INTERVAL 1 HOUR")))
.select("ID", "checkin_datetime", "checkout_datetime", "day", "hr", "minutes")
).show()
"""
+---+-------------------+-------------------+----------+---+-------+
| ID| checkin_datetime| checkout_datetime| day| hr|minutes|
+---+-------------------+-------------------+----------+---+-------+
| 4|2019-04-01 13:07:00|2019-04-01 13:09:00|2019-04-01| 13| 2|
| 4|2019-04-01 13:09:00|2019-04-01 13:12:00|2019-04-01| 13| 3|
| 4|2019-04-01 14:06:00|2019-04-01 14:07:00|2019-04-01| 14| 1|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 14| 5|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 15| 6|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 20| 37|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 21| 32|
| 22|2019-04-01 21:38:00|2019-04-01 21:42:00|2019-04-01| 21| 4|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-01| 23| 38|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-02| 0| 23|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 1| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 2| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 3| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 4| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 5| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 6| 15|
+---+-------------------+-------------------+----------+---+-------+
"""

Fill empty cells with duplicates in a DataFrame

I have a table similar to following:
+----------+----+--------------+-------------+
| Date|Hour| Weather|Precipitation|
+----------+----+--------------+-------------+
|2013-07-01| 0| null| null|
|2013-07-01| 3| null| null|
|2013-07-01| 6| clear|trace of p...|
|2013-07-01| 9| null| null|
|2013-07-01| 12| null| null|
|2013-07-01| 15| null| null|
|2013-07-01| 18| rain| null|
|2013-07-01| 21| null| null|
|2013-07-02| 0| null| null|
|2013-07-02| 3| null| null|
|2013-07-02| 6| rain|low precip...|
|2013-07-02| 9| null| null|
|2013-07-02| 12| null| null|
|2013-07-02| 15| null| null|
|2013-07-02| 18| null| null|
|2013-07-02| 21| null| null|
+----------+----+--------------+-------------+
The idea is to fill columns Weather and Precipitation with values at 6 and 18 hours and at 6 hours respectfully. Since this table illustrates a DataFrame structure, simple iteration through this seemes to be irrational.
I tried something like this:
//_weather stays for the table mentioned
def fillEmptyCells: Unit = {
val hourIndex = _weather.schema.fieldIndex("Hour")
val dateIndex = _weather.schema.fieldIndex("Date")
val weatherIndex = _weather.schema.fieldIndex("Weather")
val precipitationIndex = _weather.schema.fieldIndex("Precipitation")
val days = _weather.select("Date").distinct().rdd
days.foreach(x => {
val day = _weather.where("Date == $x(0)")
val dayValues = day.where("Hour == 6").first()
val weather = dayValues.getString(weatherIndex)
val precipitation = dayValues.getString(precipitationIndex)
day.rdd.map(y => (_(0), _(1), weather, precipitation))
})
}
However, this ugly piece of code seemes to smell because of iterating through an RDD instead of handling it in a distributed manner. It also has to form a new RDD or DataFrame from pieces what can be problematic (I have no idea how to do this). Is there more elegant and simple way to solve this task?
Assuming that you can easily create a timestamp column by combining Date and Hour, what I would do next is :
convert this timestamp (probably in milliseconds or seconds) into an hourTimestamp : .withColumn("hourTimestamp", $"timestamp" // 3600) ?
create 3 columns corresponding to the different possible hour lags (3,6,9)
coalesce these 3 columns + the original one
Here is the code for Weather (do the same for Precipitation):
val window = org.apache.spark.sql.expressions.Window.orderBy("hourTimestamp")
val weatherUpdate = df
.withColumn("WeatherLag1", lag("Weather", 3).over(window))
.withColumn("WeatherLag2", lag("Weather", 6).over(window))
.withColumn("WeatherLag3", lag("Weather", 9).over(window))
.withColumn("Weather",coalesce($"Weather",$"WeatherLag1",$"WeatherLag2",$"WeatherLag3"))

Pyspark groupBy Pivot Transformation

I'm having a hard time framing the following Pyspark dataframe manipulation.
Essentially I am trying to group by category and then pivot/unmelt the subcategories and add new columns.
I've tried a number of ways, but they are very slow and and are not leveraging Spark's parallelism.
Here is my existing (slow, verbose) code:
from pyspark.sql.functions import lit
df = sqlContext.table('Table')
#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
#loop over subcategory
listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
num = 1
for b in dfArraySub:
#renames all columns to append a number
for c in b.columns:
if c not in ['category','sub_category','date']:
column_name = str(c)+'_'+str(num)
b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
b = b.drop('sub_category')
num += 1
#if no df exists, create one and continually join new columns
try:
all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
except:
all_subs = b
#Fixes missing columns on union
try:
try:
diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
for d in diff_columns:
all_subs = all_subs.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except:
diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
for d in diff_columns:
all_cats = all_cats.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except Exception as e:
print e
all_cats = all_subs
But this is very slow. Any guidance would be greatly appreciated!
Your output is not really logical, but we can achieve this result using the pivot function. You need to precise your rules otherwise I can see a lot of cases it may fails.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df.show()
+----------+---------+------------+------------+------------+
| date| category|sub_category|metric_sales|metric_trans|
+----------+---------+------------+------------+------------+
|2018-01-01|furniture| bed| 100| 75|
|2018-01-01|furniture| chair| 110| 85|
|2018-01-01|furniture| shelf| 35| 30|
|2018-02-01|furniture| bed| 55| 50|
|2018-02-01|furniture| chair| 45| 40|
|2018-02-01|furniture| shelf| 10| 15|
|2018-01-01| rug| circle| 2| 5|
|2018-01-01| rug| square| 3| 6|
|2018-02-01| rug| circle| 3| 3|
|2018-02-01| rug| square| 4| 5|
+----------+---------+------------+------------+------------+
df.withColumn("fg", F.row_number().over(Window().partitionBy('date', 'category').orderBy("sub_category"))).groupBy('date', 'category', ).pivot('fg').sum('metric_sales', 'metric_trans').show()
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| date| category|1_sum(CAST(`metric_sales` AS BIGINT))|1_sum(CAST(`metric_trans` AS BIGINT))|2_sum(CAST(`metric_sales` AS BIGINT))|2_sum(CAST(`metric_trans` AS BIGINT))|3_sum(CAST(`metric_sales` AS BIGINT))|3_sum(CAST(`metric_trans` AS BIGINT))|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|2018-02-01| rug| 3| 3| 4| 5| null| null|
|2018-02-01|furniture| 55| 50| 45| 40| 10| 15|
|2018-01-01|furniture| 100| 75| 110| 85| 35| 30|
|2018-01-01| rug| 2| 5| 3| 6| null| null|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+

Iterate a spark dataframe with static list of values using withcolumn [duplicate]

This question already has answers here:
How do I add a new column to a Spark DataFrame (using PySpark)?
(10 answers)
Closed 5 years ago.
I am bit new on pyspark. I have a spark dataframe with about 5 columns and 5 records. I have list of 5 records.
Now I want to add these 5 static records from the list to the existing dataframe using withColumn. I did that, but its not working.
Any suggestions are greatly appreciated.
Below is my sample:
dq_results=[]
for a in range(0,len(dq_results)):
dataFile_df=dataFile_df.withColumn("dq_results",lit(dq_results[a]))
print lit(dq_results[a])
thanks,
Sreeram
dq_results=[]
Create one data frame from list dq_results:
df_list=spark.createDataFrame(dq_results_list,schema=dq_results_col)
Add one column for df_list id (it will be row id)
df_list_id = df_list.withColumn("id", monotonically_increasing_id())
Add one column for dataFile_df id (it will be row id)
dataFile_df= df_list.withColumn("id", monotonically_increasing_id())
Now we can join the both dataframe df_list and dataFile_df.
dataFile_df.join(df_list,"id").show()
So dataFile_df is final data frame
withColumn will add a new Column, but I guess you might want to append Rows instead. Try this:
df1 = spark.createDataFrame([(a, a*2, a+3, a+4, a+5) for a in range(5)], "A B C D E".split(' '))
new_data = [[100 + i*j for i in range(5)] for j in range(5)]
df1.unionAll(spark.createDataFrame(new_data)).show()
+---+---+---+---+---+
| A| B| C| D| E|
+---+---+---+---+---+
| 0| 0| 3| 4| 5|
| 1| 2| 4| 5| 6|
| 2| 4| 5| 6| 7|
| 3| 6| 6| 7| 8|
| 4| 8| 7| 8| 9|
|100|100|100|100|100|
|100|101|102|103|104|
|100|102|104|106|108|
|100|103|106|109|112|
|100|104|108|112|116|
+---+---+---+---+---+