I have a dataframe like this :
| ID_VISITE_CALCULE| TAG_TS_TO_TS|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| RK|
+--------------------+-------------------+------------------+------------------------+---+
|GA1.2.1023040287....|2019-04-23 11:24:19| dupont| null| 1|
|GA1.2.1023040287....|2019-04-23 11:24:19| durand| null| 2|
|GA1.2.105243141.1...|2019-04-23 11:21:01| null| null| 1|
|GA1.2.1061963529....|2019-04-23 11:12:19| null| null| 1|
|GA1.2.1065635192....|2019-04-23 11:07:14| antoni| null| 1|
|GA1.2.1074357108....|2019-04-23 11:11:34| lang| null| 1|
|GA1.2.1074357108....|2019-04-23 11:12:37| lang| null| 2|
|GA1.2.1075803022....|2019-04-23 11:28:38| cavail| null| 1|
|GA1.2.1080137035....|2019-04-23 11:20:00| null| null| 1|
|GA1.2.1081805479....|2019-04-23 11:10:49| null| null| 1|
|GA1.2.1081805479....|2019-04-23 11:10:49| linare| null| 2|
|GA1.2.1111218536....|2019-04-23 11:28:43| null| null| 1|
|GA1.2.1111218536....|2019-04-23 11:32:26| null| null| 2|
|GA1.2.1111570355....|2019-04-23 11:07:00| null| null| 1|
+--------------------+-------------------+------------------+------------------------+---+
I'm trying to apply rules to aggregate by ID_VISITE_CALCULE and keep only one row for an ID.
For an ID (a group), I wish:
get the first timestamp of the group and store it in a START column
get the last timestamp of the group and store it in an END column
test if EXTERNAL_PERSON_ID is the same for the whole group.
If this is the case and it is NULL then I write NULL, if it is and it is a name then I write the name. Finally if there are different values in the group then I register UNDEFINED
apply exactly the same rules to the column EXTERNAL_ORGANIZATION_ID
RESULT :
+--------------------+------------------+------------------------+-------------------+-------------------+
| ID_VISITE_CALCULE|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| START| END|
+--------------------+------------------+------------------------+-------------------+-------------------+
|GA1.2.1023040287....| undefined| null|2019-04-23 11:24:19|2019-04-23 11:24:19|
|GA1.2.105243141.1...| null| null|2019-04-23 11:21:01|2019-04-23 11:21:01|
|GA1.2.1061963529....| null| null|2019-04-23 11:12:19|2019-04-23 11:12:19|
|GA1.2.1065635192....| antoni| null|2019-04-23 11:07:14|2019-04-23 11:07:14|
|GA1.2.1074357108....| lang| null|2019-04-23 11:11:34|2019-04-23 11:12:37|
|GA1.2.1075803022....| cavail| null|2019-04-23 11:28:38|2019-04-23 11:28:38|
|GA1.2.1080137035....| null| null|2019-04-23 11:20:00|2019-04-23 11:20:00|
|GA1.2.1081805479....| undefined| null|2019-04-23 11:10:49|2019-04-23 11:10:49|
|GA1.2.1111218536....| null| null|2019-04-23 11:28:43|2019-04-23 11:32:26|
|GA1.2.1111570355....| null| null|2019-04-23 11:07:00|2019-04-23 11:07:00|
+--------------------+------------------+------------------------+-------------------+-------------------+
In my example, I only have 2 lines for a group at most, but in the real dataset I can have several hundred lines in a group.
Thank you for your kind assistance.
All can be done in single groupby call, however I'd suggest for the (slight) performance benefits and for readability of the code to split into 2 calls:
import org.apache.spark.sql.functions.{col, size, collect_set, max, min, when, lit}
val res1DF = df.groupBy(col("ID_VISITE_CALCULE")).agg(
min(col("START")).alias("START"),
max(col("END")).alias("END"),
collect_set(col("EXTERNAL_PERSON_ID")).alias("EXTERNAL_PERSON_ID"),
collect_set(col("EXTERNAL_ORGANIZATION_ID")).alias("EXTERNAL_ORGANIZATION_ID")
)
val res2DF = res1DF.withColumn("EXTERNAL_PERSON_ID",
when(
size(col("EXTERNAL_PERSON_ID")) > 1,
lit("UNDEFINED")).otherwise(col("EXTERNAL_PERSON_ID").getItem(0)
)
).withColumn("EXTERNAL_ORGANIZATION_ID",
when(
size(col("EXTERNAL_ORGANIZATION_ID")) > 1,
lit("UNDEFINED")).otherwise(col("EXTERNAL_ORGANIZATION_ID").getItem(0)
)
)
The method getItem does most of the conditions in the background. If the set of values is empty, it will return null and if there is just 1 single value, it will return the value.
/It would be good if you show some code/ Sample Data from where dataframe is built.
Assuming you have a dataframe as tableDf
** Spark Sql Solution **
tableDf.createOrReplaceTempView("input_table")
val sqlStr ="""
select ID_VISITE_CALCULE,
(case when count(distinct person_id_calculation) > 1 then "undefined"
when count(distinct person_id_calculation) = 1 and
max(person_id_calculation) = "noNull" then ""
else max(person_id_calculation)) as EXTERNAL_PERSON_ID,
-- do the same for EXTERNAL_ORGANISATION_ID
max(start_v) as start_v, max(last_v) as last_v
from
(select ID_VISITE_CALCULE,
( case
when nvl(EXTERNAL_PERSON_ID,"noNull") =
lag(EXTERNAL_PERSON_ID,1,"noNull")over(partition by
ID_VISITE_CALCULE order by TAG_TS_TO_TS) then
EXTERNAL_PERSON_ID
else "undefined" end ) AS person_id_calculation,
-- Same calculation for EXTERNAL_ORGANISATION_ID
first(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by
TAG_TS_TO_TS) as START_V,
last(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by
TAG_TS_TO_TS) as last_V
from input_table ) a
group by 1
"""
val resultDf = spark.sql(sqlStr)
Related
I have using Scala Spark on intellij IDE to analyze a csv file having 672,112 records . File is available on the link - https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding
File name : kiva_loans.csv
I ran show() command to view few records and it is reading all columns correctly but when I apply group by on the column "repayment_interval", it displays value which appears to be data from other columns (column shift ) as shown below.
distinct values in the "repayment_interval" columns are
Monthly (More frequent)
irregular
bullet
weekly (less frequent)
For testing purpose, I searched for values given in the screenshot and put those rows in a separate file and tried to read that file using scala spark. It is showing all values in correct column and even groupby is returning correct values.
I am facing this issue with describe() function.
As shown in above image , column - id & "funded_amount" is numeric columns but not sure why describe() on them is giving string values for "min","max".
read csv command as below
val kivaloans=spark.read
//.option("sep",",")
.format("com.databricks.spark.csv")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
printSchema output after adding ".option("multiline","true")". It is reading few rows as header as shown in the highlighted yellow color.
It seems, there are new line characters in columns data. Hence, set property multiline as true.
val kivaloans=spark.read.format("com.databricks.spark.csv")
.option("multiline","true")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
Data summary is as follows after setting multiline as true:
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
|summary| id| funded_amount| loan_amount| activity| sector| use| country_code| country| region| currency| partner_id| posted_time| disbursed_time| funded_time| term_in_months| lender_count| tags| borrower_genders| repayment_interval| date|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
| count| 671205| 671205| 671205| 671205| 671205| 666977| 671197| 671205| 614441| 671186| 657699| 671195| 668808| 622890| 671196| 671199| 499834| 666957| 671191| 671199|
| mean| 993248.5937336581|785.9950611214159|842.3971066961659| null| null| 10000.0| null| null| null| null| 178.20274555550654| 162.01020408163265| 179.12244897959184| 189.3|13.74266332047713|20.588457578299735| 25.68553459119497| 26.4| 26.210526315789473| 27.304347826086957|
| stddev|196611.27542282813|1130.398941057504|1198.660072882945| null| null| NaN| null| null| null| null| 94.24892231613454| 78.65564973356628| 100.70555939905975| 125.87299363372507|8.631922222356161|28.458485403188924| 31.131029407317044| 35.87289875191111| 52.43279244938066| 41.99181173710449|
| min| 653047| 0.0| 25.0|Adult Care|AgricultuTo buy chicken.| ""fajas"" [wove...| 10 boxes of cream| 3x1 purlins| T-shaped brackets| among other prod...| among other item...| and pay for labour"| and cassava to m...| yeast| rice| milk| among other prod...|#Animals, #Biz Du...| #Elderly|
| 25%| 823364| 250.0| 275.0| null| null| 10000.0| null| null| null| null| 126.0| 123.0| 105.0| 87.0| 8.0| 7.0| 8.0| 8.0| 9.0| 6.0|
| 50%| 992996| 450.0| 500.0| null| null| 10000.0| null| null| null| null| 145.0| 144.0| 144.0| 137.0| 13.0| 13.0| 14.0| 15.0| 14.0| 17.0|
| 75%| 1163938| 900.0| 1000.0| null| null| 10000.0| null| null| null| null| 204.0| 177.0| 239.0| 201.0| 14.0| 24.0| 27.0| 31.0| 24.0| 34.0|
| max| 1340339| 100000.0| 100000.0| Wholesale| Wholesale|? provide a safer...| ZW| Zimbabwe| ?ZM?T| baguida| XOF| XOF| Yoro, Yoro| USD| USD| USD|volunteer_pick, v...|volunteer_pick, v...| weekly|volunteer_pick, v...|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
I am using spark with Scala to transform a Dataframe , where I would like to compute a new variable which calculates the rank of one variable per row within many variables.
Example -
Input DF-
+---+---+---+
|c_0|c_1|c_2|
+---+---+---+
| 11| 11| 35|
| 22| 12| 66|
| 44| 22| 12|
+---+---+---+
Expected DF-
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 11| 35| 2| 3| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
This has aleady been answered using R - Rank per row over multiple columns in R,
but I need to do the same in spark-sql using scala. Thanks for the Help!
Edit- 4/1 . Encountered one scenario where if the values are same the ranks should be different. Editing first row for replicating the situation.
If I understand correctly, you want to have the rank of each column, within each row.
Let's first define the data, and the columns to "rank".
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
val cols = df.columns
Then we define a UDF that finds the index of an element in an array.
val pos = udf((a : Seq[Int], elt : Int) => a.indexOf(elt)+1)
We finally create a sorted array (in descending order) and use the UDF to find the rank of each column.
val ranks = cols.map(c => pos(col("array"), col(c)).as(c+"_rank"))
df.withColumn("array", sort_array(array(cols.map(col) : _*), false))
.select((cols.map(col)++ranks) :_*).show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 12| 35| 3| 2| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
EDIT:
As of Spark 2.4, the pos UDF that I defined can be replaced by the built in function array_position(column: Column, value: Any) that works exactly the same way (the first index is 1). This avoids using UDFs that can be slightly less efficient.
EDIT2:
The code above will generate duplicated indices in case you have duplidated keys. If you want to avoid it, you can create the array, zip it to remember which column is which, sort it and zip it again to get the final rank. It would look like this:
val colMap = df.columns.zipWithIndex.map(_.swap).toMap
val zip = udf((s: Seq[Int]) => s
.zipWithIndex
.sortBy(-_._1)
.map(_._2)
.zipWithIndex
.toMap
.mapValues(_+1))
val ranks = (0 until cols.size)
.map(i => 'zip.getItem(i) as colMap(i) + "_rank")
val result = df
.withColumn("zip", zip(array(cols.map(col) : _*)))
.select(cols.map(col) ++ ranks :_*)
One way to go about this would be to use windows.
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
d.withColumn(column+"_rank", rank() over Window.orderBy(desc(column))))
.show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 22| 12| 66| 2| 3| 1|
| 11| 21| 35| 3| 2| 2|
| 44| 22| 12| 1| 1| 3|
+---+---+---+--------+--------+--------+
But this is not a good idea. All the data will end up in one partition which will cause an OOM error if all the data does not fit inside one executor.
Another way would require to sort the dataframe three times, but at least that would scale to any size of data.
Let's define a function that zips a dataframe with consecutive indices (it exists for RDDs but not for dataframes)
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ (i+1)) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
And let's use it on the same dataframe df:
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
zipWithIndex(d.orderBy(desc(column)), column+"_rank"))
.show
which provides the exact same result as above.
You could probably create a window function. Do note that this is susceptible to OOM if you have too much data. But, I just wanted to introduce to the concept of window functions here.
inputDF.createOrReplaceTempView("my_df")
val expectedDF = spark.sql("""
select
c_0
, c_1
, c_2
, rank(c_0) over (order by c_0 desc) c_0_rank
, rank(c_1) over (order by c_1 desc) c_1_rank
, rank(c_2) over (order by c_2 desc) c_2_rank
from my_df""")
expectedDF.show()
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 44| 22| 12| 3| 3| 1|
| 11| 21| 35| 1| 2| 2|
| 22| 12| 66| 2| 1| 3|
+---+---+---+--------+--------+--------+
I have a table similar to following:
+----------+----+--------------+-------------+
| Date|Hour| Weather|Precipitation|
+----------+----+--------------+-------------+
|2013-07-01| 0| null| null|
|2013-07-01| 3| null| null|
|2013-07-01| 6| clear|trace of p...|
|2013-07-01| 9| null| null|
|2013-07-01| 12| null| null|
|2013-07-01| 15| null| null|
|2013-07-01| 18| rain| null|
|2013-07-01| 21| null| null|
|2013-07-02| 0| null| null|
|2013-07-02| 3| null| null|
|2013-07-02| 6| rain|low precip...|
|2013-07-02| 9| null| null|
|2013-07-02| 12| null| null|
|2013-07-02| 15| null| null|
|2013-07-02| 18| null| null|
|2013-07-02| 21| null| null|
+----------+----+--------------+-------------+
The idea is to fill columns Weather and Precipitation with values at 6 and 18 hours and at 6 hours respectfully. Since this table illustrates a DataFrame structure, simple iteration through this seemes to be irrational.
I tried something like this:
//_weather stays for the table mentioned
def fillEmptyCells: Unit = {
val hourIndex = _weather.schema.fieldIndex("Hour")
val dateIndex = _weather.schema.fieldIndex("Date")
val weatherIndex = _weather.schema.fieldIndex("Weather")
val precipitationIndex = _weather.schema.fieldIndex("Precipitation")
val days = _weather.select("Date").distinct().rdd
days.foreach(x => {
val day = _weather.where("Date == $x(0)")
val dayValues = day.where("Hour == 6").first()
val weather = dayValues.getString(weatherIndex)
val precipitation = dayValues.getString(precipitationIndex)
day.rdd.map(y => (_(0), _(1), weather, precipitation))
})
}
However, this ugly piece of code seemes to smell because of iterating through an RDD instead of handling it in a distributed manner. It also has to form a new RDD or DataFrame from pieces what can be problematic (I have no idea how to do this). Is there more elegant and simple way to solve this task?
Assuming that you can easily create a timestamp column by combining Date and Hour, what I would do next is :
convert this timestamp (probably in milliseconds or seconds) into an hourTimestamp : .withColumn("hourTimestamp", $"timestamp" // 3600) ?
create 3 columns corresponding to the different possible hour lags (3,6,9)
coalesce these 3 columns + the original one
Here is the code for Weather (do the same for Precipitation):
val window = org.apache.spark.sql.expressions.Window.orderBy("hourTimestamp")
val weatherUpdate = df
.withColumn("WeatherLag1", lag("Weather", 3).over(window))
.withColumn("WeatherLag2", lag("Weather", 6).over(window))
.withColumn("WeatherLag3", lag("Weather", 9).over(window))
.withColumn("Weather",coalesce($"Weather",$"WeatherLag1",$"WeatherLag2",$"WeatherLag3"))
Scala 2.12 and Spark 2.2.1 here. I have the following code:
myDf.show(5)
myDf.withColumn("rank", myDf("rank") * 10)
myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
When I run this, in the logs I see:
+---------+-----------+----+
|fizz|buzz|rizzrankrid|rank|
+---------+-----------+----+
| 2| 5| 1440370637| 128|
| 2| 5| 2114144780|1352|
| 2| 8| 199559784|3233|
| 2| 5| 1522258372| 895|
| 2| 9| 918480276| 882|
+---------+-----------+----+
And now:
+---------+-----------+-----+
|fizz|buzz|rizzrankrid| rank|
+---------+-----------+-----+
| 2| 5| 1440370637| 1280|
| 2| 5| 2114144780|13520|
| 2| 8| 199559784|32330|
| 2| 5| 1522258372| 8950|
| 2| 9| 918480276| 8820|
+---------+-----------+-----+
So, interesting:
The first withColumn works, transforming each row's rank value by multiplying itself by 10
However the second withColumn fails, which is just adding the current date/time to all rows as a new lastRanOn column
What do I need to do to get the lastRanOn column addition working?
Your example is probably too simple, because modifying rank should also not work.
withColumn does not update DataFrame, it's create a new DataFrame.
So you must do:
// if myDf is a var
myDf.show(5)
myDf = myDf.withColumn("rank", myDf("rank") * 10)
myDf = myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
or for example:
myDf.withColumn("rank", myDf("rank") * 10).withColumn("lastRanOn", current_date()).show(5)
Only then you will have new column added - after reassigning new DataFrame reference
I have the following dataframe:
df.show
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 2|
|2017-05-20| 1|
|2017-06-23| 2|
|2017-06-16| 3|
|2017-06-30| 1|
I want to replace the count values by 0, where it is greater than 1, i.e., the resultant dataframe should be:
+----------+-----+
| createdon|count|
+----------+-----+
|2017-06-28| 1|
|2017-06-17| 0|
|2017-05-20| 1|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| 1|
I tried the following expression:
df.withColumn("count", when(($"count" > 1), 0)).show
but the output was
+----------+--------+
| createdon| count|
+----------+--------+
|2017-06-28| null|
|2017-06-17| 0|
|2017-05-20| null|
|2017-06-23| 0|
|2017-06-16| 0|
|2017-06-30| null|
I am not able to understand, why for the value 1, null is getting displayed and how to overcome that. Can anyone help me?
You need to chain otherwise after when to specify the values where the conditions don't hold; In your case, it would be count column itself:
df.withColumn("count", when(($"count" > 1), 0).otherwise($"count"))
This can be done using udf function too
def replaceWithZero = udf((col: Int) => if(col > 1) 0 else col) //udf function
df.withColumn("count", replaceWithZero($"count")).show(false) //calling udf function
Note : udf functions should always be the choice only when there is no inbuilt functions as it requires serialization and deserialization of column data.