Fill empty cells with duplicates in a DataFrame - scala

I have a table similar to following:
+----------+----+--------------+-------------+
| Date|Hour| Weather|Precipitation|
+----------+----+--------------+-------------+
|2013-07-01| 0| null| null|
|2013-07-01| 3| null| null|
|2013-07-01| 6| clear|trace of p...|
|2013-07-01| 9| null| null|
|2013-07-01| 12| null| null|
|2013-07-01| 15| null| null|
|2013-07-01| 18| rain| null|
|2013-07-01| 21| null| null|
|2013-07-02| 0| null| null|
|2013-07-02| 3| null| null|
|2013-07-02| 6| rain|low precip...|
|2013-07-02| 9| null| null|
|2013-07-02| 12| null| null|
|2013-07-02| 15| null| null|
|2013-07-02| 18| null| null|
|2013-07-02| 21| null| null|
+----------+----+--------------+-------------+
The idea is to fill columns Weather and Precipitation with values at 6 and 18 hours and at 6 hours respectfully. Since this table illustrates a DataFrame structure, simple iteration through this seemes to be irrational.
I tried something like this:
//_weather stays for the table mentioned
def fillEmptyCells: Unit = {
val hourIndex = _weather.schema.fieldIndex("Hour")
val dateIndex = _weather.schema.fieldIndex("Date")
val weatherIndex = _weather.schema.fieldIndex("Weather")
val precipitationIndex = _weather.schema.fieldIndex("Precipitation")
val days = _weather.select("Date").distinct().rdd
days.foreach(x => {
val day = _weather.where("Date == $x(0)")
val dayValues = day.where("Hour == 6").first()
val weather = dayValues.getString(weatherIndex)
val precipitation = dayValues.getString(precipitationIndex)
day.rdd.map(y => (_(0), _(1), weather, precipitation))
})
}
However, this ugly piece of code seemes to smell because of iterating through an RDD instead of handling it in a distributed manner. It also has to form a new RDD or DataFrame from pieces what can be problematic (I have no idea how to do this). Is there more elegant and simple way to solve this task?

Assuming that you can easily create a timestamp column by combining Date and Hour, what I would do next is :
convert this timestamp (probably in milliseconds or seconds) into an hourTimestamp : .withColumn("hourTimestamp", $"timestamp" // 3600) ?
create 3 columns corresponding to the different possible hour lags (3,6,9)
coalesce these 3 columns + the original one
Here is the code for Weather (do the same for Precipitation):
val window = org.apache.spark.sql.expressions.Window.orderBy("hourTimestamp")
val weatherUpdate = df
.withColumn("WeatherLag1", lag("Weather", 3).over(window))
.withColumn("WeatherLag2", lag("Weather", 6).over(window))
.withColumn("WeatherLag3", lag("Weather", 9).over(window))
.withColumn("Weather",coalesce($"Weather",$"WeatherLag1",$"WeatherLag2",$"WeatherLag3"))

Related

Scala Spark functions like group by, describe() returning incorrect result

I have using Scala Spark on intellij IDE to analyze a csv file having 672,112 records . File is available on the link - https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding
File name : kiva_loans.csv
I ran show() command to view few records and it is reading all columns correctly but when I apply group by on the column "repayment_interval", it displays value which appears to be data from other columns (column shift ) as shown below.
distinct values in the "repayment_interval" columns are
Monthly (More frequent)
irregular
bullet
weekly (less frequent)
For testing purpose, I searched for values given in the screenshot and put those rows in a separate file and tried to read that file using scala spark. It is showing all values in correct column and even groupby is returning correct values.
I am facing this issue with describe() function.
As shown in above image , column - id & "funded_amount" is numeric columns but not sure why describe() on them is giving string values for "min","max".
read csv command as below
val kivaloans=spark.read
//.option("sep",",")
.format("com.databricks.spark.csv")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
printSchema output after adding ".option("multiline","true")". It is reading few rows as header as shown in the highlighted yellow color.
It seems, there are new line characters in columns data. Hence, set property multiline as true.
val kivaloans=spark.read.format("com.databricks.spark.csv")
.option("multiline","true")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
Data summary is as follows after setting multiline as true:
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
|summary| id| funded_amount| loan_amount| activity| sector| use| country_code| country| region| currency| partner_id| posted_time| disbursed_time| funded_time| term_in_months| lender_count| tags| borrower_genders| repayment_interval| date|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
| count| 671205| 671205| 671205| 671205| 671205| 666977| 671197| 671205| 614441| 671186| 657699| 671195| 668808| 622890| 671196| 671199| 499834| 666957| 671191| 671199|
| mean| 993248.5937336581|785.9950611214159|842.3971066961659| null| null| 10000.0| null| null| null| null| 178.20274555550654| 162.01020408163265| 179.12244897959184| 189.3|13.74266332047713|20.588457578299735| 25.68553459119497| 26.4| 26.210526315789473| 27.304347826086957|
| stddev|196611.27542282813|1130.398941057504|1198.660072882945| null| null| NaN| null| null| null| null| 94.24892231613454| 78.65564973356628| 100.70555939905975| 125.87299363372507|8.631922222356161|28.458485403188924| 31.131029407317044| 35.87289875191111| 52.43279244938066| 41.99181173710449|
| min| 653047| 0.0| 25.0|Adult Care|AgricultuTo buy chicken.| ""fajas"" [wove...| 10 boxes of cream| 3x1 purlins| T-shaped brackets| among other prod...| among other item...| and pay for labour"| and cassava to m...| yeast| rice| milk| among other prod...|#Animals, #Biz Du...| #Elderly|
| 25%| 823364| 250.0| 275.0| null| null| 10000.0| null| null| null| null| 126.0| 123.0| 105.0| 87.0| 8.0| 7.0| 8.0| 8.0| 9.0| 6.0|
| 50%| 992996| 450.0| 500.0| null| null| 10000.0| null| null| null| null| 145.0| 144.0| 144.0| 137.0| 13.0| 13.0| 14.0| 15.0| 14.0| 17.0|
| 75%| 1163938| 900.0| 1000.0| null| null| 10000.0| null| null| null| null| 204.0| 177.0| 239.0| 201.0| 14.0| 24.0| 27.0| 31.0| 24.0| 34.0|
| max| 1340339| 100000.0| 100000.0| Wholesale| Wholesale|? provide a safer...| ZW| Zimbabwe| ?ZM?T| baguida| XOF| XOF| Yoro, Yoro| USD| USD| USD|volunteer_pick, v...|volunteer_pick, v...| weekly|volunteer_pick, v...|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+

six months record of every account number in pyspark

I have tried rank function but it gives results in numbers which goes beyond 180 days.
This is the result I am getting but I do not want this result this is wrong it is giving transaction beyond 180 days
window = Window.partitionBy(df3['acctno']).orderBy(df3['trans_date'])
df3.select('*', rank().over(window).alias('rank')) \
.filter(col('rank') <= 180) \
.show(500)
{ |year|month|day|date| txnrefid| acc|branch|channel|rank|
+----+-----+---+-----------+-----------+--------------+----------+-----------+----+
|2020| 2| 6| 2020-02-06| 1234abcd6| 2074-556-1111| 6666| CBS| 1|
|2020| 2| 7| 2020-02-07| 1234abcd7| 2074-556-1111| 6666| CBS| 2|
|2020| 2| 8| 2020-02-08| 1234abcd8| 2074-556-1111| 6666| CBS| 3|
|2020| 2| 9| 2020-02-09| 1234abcd9| 2074-556-1111| 6666| CBS| 4|
But I want like this
{|year|month|day|date| txnrefid| acc|branch|channel|rank|
|2020| 2| 6| 2020-02-06| 1234abcd6| 2074-556-1111| 6666| CBS| 1|
|2020| 2| 7| 2020-02-07| 1234abcd7| 2074-556-1111| 6666| CBS| 2|
|2020| 2| 8| 2020-02-08| 1234abcd8| 2074-556-1111| 6666| CBS| 3|
|2020| 2| 9| 2020-02-09| 1234abcd9| 2074-556-1111| 6666|
}
As you edited your question, here is a new answer that use a different approach.
The idea is to get for each account number the min date, compute the limite date (min date + 180 days) then remove all the lines that are older.
df.count() # I used your sample data, so 60 lines
> 60
w = Window.partitionBy(df["acctno"])
df = df.withColumn("min_date", F.min(F.col("trans_date").cast("date")).over(w))
df = df.where(
F.col("trans_date").cast("date")
<= F.date_add( # Use F.date_add to add days or F.add_months to add month.
F.col("min_date"), 180
)
).drop("min_date")
df.count() # Final dataframe limited to 180 days, nothing older than 2020-08-04
> 54
if you want the first 6 month, then you should use fields "year" and "month", not "trans_date".
something like window = Window.partitionBy(df3['acctno']).orderBy(df3['year'], df3['month']) should give you better results.
Then you filter on rank <= 6
df3.select("*", dense_rank().over(window).alias("rank")).filter(col("rank") <= 6).show(500)
EDIT: You need to use a dense_rank

GroupBy with condition on aggregate Spark/Scala

I have a dataframe like this :
| ID_VISITE_CALCULE| TAG_TS_TO_TS|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| RK|
+--------------------+-------------------+------------------+------------------------+---+
|GA1.2.1023040287....|2019-04-23 11:24:19| dupont| null| 1|
|GA1.2.1023040287....|2019-04-23 11:24:19| durand| null| 2|
|GA1.2.105243141.1...|2019-04-23 11:21:01| null| null| 1|
|GA1.2.1061963529....|2019-04-23 11:12:19| null| null| 1|
|GA1.2.1065635192....|2019-04-23 11:07:14| antoni| null| 1|
|GA1.2.1074357108....|2019-04-23 11:11:34| lang| null| 1|
|GA1.2.1074357108....|2019-04-23 11:12:37| lang| null| 2|
|GA1.2.1075803022....|2019-04-23 11:28:38| cavail| null| 1|
|GA1.2.1080137035....|2019-04-23 11:20:00| null| null| 1|
|GA1.2.1081805479....|2019-04-23 11:10:49| null| null| 1|
|GA1.2.1081805479....|2019-04-23 11:10:49| linare| null| 2|
|GA1.2.1111218536....|2019-04-23 11:28:43| null| null| 1|
|GA1.2.1111218536....|2019-04-23 11:32:26| null| null| 2|
|GA1.2.1111570355....|2019-04-23 11:07:00| null| null| 1|
+--------------------+-------------------+------------------+------------------------+---+
I'm trying to apply rules to aggregate by ID_VISITE_CALCULE and keep only one row for an ID.
For an ID (a group), I wish:
get the first timestamp of the group and store it in a START column
get the last timestamp of the group and store it in an END column
test if EXTERNAL_PERSON_ID is the same for the whole group.
If this is the case and it is NULL then I write NULL, if it is and it is a name then I write the name. Finally if there are different values in the group then I register UNDEFINED
apply exactly the same rules to the column EXTERNAL_ORGANIZATION_ID
RESULT :
+--------------------+------------------+------------------------+-------------------+-------------------+
| ID_VISITE_CALCULE|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| START| END|
+--------------------+------------------+------------------------+-------------------+-------------------+
|GA1.2.1023040287....| undefined| null|2019-04-23 11:24:19|2019-04-23 11:24:19|
|GA1.2.105243141.1...| null| null|2019-04-23 11:21:01|2019-04-23 11:21:01|
|GA1.2.1061963529....| null| null|2019-04-23 11:12:19|2019-04-23 11:12:19|
|GA1.2.1065635192....| antoni| null|2019-04-23 11:07:14|2019-04-23 11:07:14|
|GA1.2.1074357108....| lang| null|2019-04-23 11:11:34|2019-04-23 11:12:37|
|GA1.2.1075803022....| cavail| null|2019-04-23 11:28:38|2019-04-23 11:28:38|
|GA1.2.1080137035....| null| null|2019-04-23 11:20:00|2019-04-23 11:20:00|
|GA1.2.1081805479....| undefined| null|2019-04-23 11:10:49|2019-04-23 11:10:49|
|GA1.2.1111218536....| null| null|2019-04-23 11:28:43|2019-04-23 11:32:26|
|GA1.2.1111570355....| null| null|2019-04-23 11:07:00|2019-04-23 11:07:00|
+--------------------+------------------+------------------------+-------------------+-------------------+
In my example, I only have 2 lines for a group at most, but in the real dataset I can have several hundred lines in a group.
Thank you for your kind assistance.
All can be done in single groupby call, however I'd suggest for the (slight) performance benefits and for readability of the code to split into 2 calls:
import org.apache.spark.sql.functions.{col, size, collect_set, max, min, when, lit}
val res1DF = df.groupBy(col("ID_VISITE_CALCULE")).agg(
min(col("START")).alias("START"),
max(col("END")).alias("END"),
collect_set(col("EXTERNAL_PERSON_ID")).alias("EXTERNAL_PERSON_ID"),
collect_set(col("EXTERNAL_ORGANIZATION_ID")).alias("EXTERNAL_ORGANIZATION_ID")
)
val res2DF = res1DF.withColumn("EXTERNAL_PERSON_ID",
when(
size(col("EXTERNAL_PERSON_ID")) > 1,
lit("UNDEFINED")).otherwise(col("EXTERNAL_PERSON_ID").getItem(0)
)
).withColumn("EXTERNAL_ORGANIZATION_ID",
when(
size(col("EXTERNAL_ORGANIZATION_ID")) > 1,
lit("UNDEFINED")).otherwise(col("EXTERNAL_ORGANIZATION_ID").getItem(0)
)
)
The method getItem does most of the conditions in the background. If the set of values is empty, it will return null and if there is just 1 single value, it will return the value.
/It would be good if you show some code/ Sample Data from where dataframe is built.
Assuming you have a dataframe as tableDf
** Spark Sql Solution **
tableDf.createOrReplaceTempView("input_table")
val sqlStr ="""
select ID_VISITE_CALCULE,
(case when count(distinct person_id_calculation) > 1 then "undefined"
when count(distinct person_id_calculation) = 1 and
max(person_id_calculation) = "noNull" then ""
else max(person_id_calculation)) as EXTERNAL_PERSON_ID,
-- do the same for EXTERNAL_ORGANISATION_ID
max(start_v) as start_v, max(last_v) as last_v
from
(select ID_VISITE_CALCULE,
( case
when nvl(EXTERNAL_PERSON_ID,"noNull") =
lag(EXTERNAL_PERSON_ID,1,"noNull")over(partition by
ID_VISITE_CALCULE order by TAG_TS_TO_TS) then
EXTERNAL_PERSON_ID
else "undefined" end ) AS person_id_calculation,
-- Same calculation for EXTERNAL_ORGANISATION_ID
first(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by
TAG_TS_TO_TS) as START_V,
last(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by
TAG_TS_TO_TS) as last_V
from input_table ) a
group by 1
"""
val resultDf = spark.sql(sqlStr)

Pyspark groupBy Pivot Transformation

I'm having a hard time framing the following Pyspark dataframe manipulation.
Essentially I am trying to group by category and then pivot/unmelt the subcategories and add new columns.
I've tried a number of ways, but they are very slow and and are not leveraging Spark's parallelism.
Here is my existing (slow, verbose) code:
from pyspark.sql.functions import lit
df = sqlContext.table('Table')
#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
#loop over subcategory
listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
num = 1
for b in dfArraySub:
#renames all columns to append a number
for c in b.columns:
if c not in ['category','sub_category','date']:
column_name = str(c)+'_'+str(num)
b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
b = b.drop('sub_category')
num += 1
#if no df exists, create one and continually join new columns
try:
all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
except:
all_subs = b
#Fixes missing columns on union
try:
try:
diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
for d in diff_columns:
all_subs = all_subs.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except:
diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
for d in diff_columns:
all_cats = all_cats.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except Exception as e:
print e
all_cats = all_subs
But this is very slow. Any guidance would be greatly appreciated!
Your output is not really logical, but we can achieve this result using the pivot function. You need to precise your rules otherwise I can see a lot of cases it may fails.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df.show()
+----------+---------+------------+------------+------------+
| date| category|sub_category|metric_sales|metric_trans|
+----------+---------+------------+------------+------------+
|2018-01-01|furniture| bed| 100| 75|
|2018-01-01|furniture| chair| 110| 85|
|2018-01-01|furniture| shelf| 35| 30|
|2018-02-01|furniture| bed| 55| 50|
|2018-02-01|furniture| chair| 45| 40|
|2018-02-01|furniture| shelf| 10| 15|
|2018-01-01| rug| circle| 2| 5|
|2018-01-01| rug| square| 3| 6|
|2018-02-01| rug| circle| 3| 3|
|2018-02-01| rug| square| 4| 5|
+----------+---------+------------+------------+------------+
df.withColumn("fg", F.row_number().over(Window().partitionBy('date', 'category').orderBy("sub_category"))).groupBy('date', 'category', ).pivot('fg').sum('metric_sales', 'metric_trans').show()
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| date| category|1_sum(CAST(`metric_sales` AS BIGINT))|1_sum(CAST(`metric_trans` AS BIGINT))|2_sum(CAST(`metric_sales` AS BIGINT))|2_sum(CAST(`metric_trans` AS BIGINT))|3_sum(CAST(`metric_sales` AS BIGINT))|3_sum(CAST(`metric_trans` AS BIGINT))|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|2018-02-01| rug| 3| 3| 4| 5| null| null|
|2018-02-01|furniture| 55| 50| 45| 40| 10| 15|
|2018-01-01|furniture| 100| 75| 110| 85| 35| 30|
|2018-01-01| rug| 2| 5| 3| 6| null| null|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+

Find and replace not working - dataframe spark scala

I have the following dataframe:
df.show
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 2|
|2017-05-20| 1|
|2017-06-23| 2|
|2017-06-16| 3|
|2017-06-30| 1|
I want to replace the count values by 0, where it is greater than 1, i.e., the resultant dataframe should be:
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 0|
|2017-05-20| 1|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| 1|
I tried the following expression:
df.withColumn("count", when(($"count" > 1), 0)).show
but the output was
+----------+--------+
| createdon| count|
+----------+--------+
|2017-06-28| null|
|2017-06-17| 0|
|2017-05-20| null|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| null|
I am not able to understand, why for the value 1, null is getting displayed and how to overcome that. Can anyone help me?
You need to chain otherwise after when to specify the values where the conditions don't hold; In your case, it would be count column itself:
df.withColumn("count", when(($"count" > 1), 0).otherwise($"count"))
This can be done using udf function too
def replaceWithZero = udf((col: Int) => if(col > 1) 0 else col) //udf function
df.withColumn("count", replaceWithZero($"count")).show(false) //calling udf function
Note : udf functions should always be the choice only when there is no inbuilt functions as it requires serialization and deserialization of column data.