PySpark - Struggling to arrange the data by a specific format

PySpark - Struggling to arrange the data by a specific format - pyspark

I am working on outputting total deduplicated counts from a pre-aggregated frame as follows.
I currently have a data frame that displays like so. It's the initial structure and the point that I have gotten to by filtering out unneeded columns.
ID
Source
101
Grape
101
Flower
102
Bee
103
Peach
105
Flower
We can see from the example above that 101 is found in both Grape and Flower. I would like to arrange the format so that the distinct string values from the "Source" column become their own sources, as from there I can perform a groupBy for a specific arrangement of yes's and no's as so.
ID
Grape
Flower
Bee
Peach
101
Yes
Yes
No
No
102
No
No
Yes
No
103
No
No
No
Yes
I agree that creating this manually via the above example is a good fit, but I am working with +100m rows and need something more susinct.
What I've extracted so far is a list of distinct Source values and arranged them into a list:
dedupeTableColumnNames = dedupeTable.select('SOURCE').distinct().collect()
dedupeTableColumnNamesCleaned = re.findall(r"'([^']*)'", str(dedupeTableColumnNames))

That's just a pivot :
df.groupBy("id").pivot("source").count().show()
+---+------+------+------+------+
| id|Bee |Flower|Grape |Peach |
+---+------+------+------+------+
|103| null| null| null| 1|
|105| null| 1| null| null|
|101| null| 1| 1| null|
|102| 1| null| null| null|
+---+------+------+------+------+

Related

Scala Spark functions like group by, describe() returning incorrect result

I have using Scala Spark on intellij IDE to analyze a csv file having 672,112 records . File is available on the link - https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding
File name : kiva_loans.csv
I ran show() command to view few records and it is reading all columns correctly but when I apply group by on the column "repayment_interval", it displays value which appears to be data from other columns (column shift ) as shown below.
distinct values in the "repayment_interval" columns are
Monthly (More frequent)
irregular
bullet
weekly (less frequent)
For testing purpose, I searched for values given in the screenshot and put those rows in a separate file and tried to read that file using scala spark. It is showing all values in correct column and even groupby is returning correct values.
I am facing this issue with describe() function.
As shown in above image , column - id & "funded_amount" is numeric columns but not sure why describe() on them is giving string values for "min","max".
read csv command as below
val kivaloans=spark.read
//.option("sep",",")
.format("com.databricks.spark.csv")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
printSchema output after adding ".option("multiline","true")". It is reading few rows as header as shown in the highlighted yellow color.

It seems, there are new line characters in columns data. Hence, set property multiline as true.
val kivaloans=spark.read.format("com.databricks.spark.csv")
.option("multiline","true")
.option("header",true)
.option("inferschema","true")
.csv("kiva_loans.csv")
Data summary is as follows after setting multiline as true:
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
|summary| id| funded_amount| loan_amount| activity| sector| use| country_code| country| region| currency| partner_id| posted_time| disbursed_time| funded_time| term_in_months| lender_count| tags| borrower_genders| repayment_interval| date|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+
| count| 671205| 671205| 671205| 671205| 671205| 666977| 671197| 671205| 614441| 671186| 657699| 671195| 668808| 622890| 671196| 671199| 499834| 666957| 671191| 671199|
| mean| 993248.5937336581|785.9950611214159|842.3971066961659| null| null| 10000.0| null| null| null| null| 178.20274555550654| 162.01020408163265| 179.12244897959184| 189.3|13.74266332047713|20.588457578299735| 25.68553459119497| 26.4| 26.210526315789473| 27.304347826086957|
| stddev|196611.27542282813|1130.398941057504|1198.660072882945| null| null| NaN| null| null| null| null| 94.24892231613454| 78.65564973356628| 100.70555939905975| 125.87299363372507|8.631922222356161|28.458485403188924| 31.131029407317044| 35.87289875191111| 52.43279244938066| 41.99181173710449|
| min| 653047| 0.0| 25.0|Adult Care|AgricultuTo buy chicken.| ""fajas"" [wove...| 10 boxes of cream| 3x1 purlins| T-shaped brackets| among other prod...| among other item...| and pay for labour"| and cassava to m...| yeast| rice| milk| among other prod...|#Animals, #Biz Du...| #Elderly|
| 25%| 823364| 250.0| 275.0| null| null| 10000.0| null| null| null| null| 126.0| 123.0| 105.0| 87.0| 8.0| 7.0| 8.0| 8.0| 9.0| 6.0|
| 50%| 992996| 450.0| 500.0| null| null| 10000.0| null| null| null| null| 145.0| 144.0| 144.0| 137.0| 13.0| 13.0| 14.0| 15.0| 14.0| 17.0|
| 75%| 1163938| 900.0| 1000.0| null| null| 10000.0| null| null| null| null| 204.0| 177.0| 239.0| 201.0| 14.0| 24.0| 27.0| 31.0| 24.0| 34.0|
| max| 1340339| 100000.0| 100000.0| Wholesale| Wholesale|? provide a safer...| ZW| Zimbabwe| ?ZM?T| baguida| XOF| XOF| Yoro, Yoro| USD| USD| USD|volunteer_pick, v...|volunteer_pick, v...| weekly|volunteer_pick, v...|
+-------+------------------+-----------------+-----------------+----------+-----------+--------------------+--------------------+------------------+------------+------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+--------------------+--------------------+

PySpark: Group by two columns, count the pairs, and divide the average of two different columns

I have a dataframe with several columns, some of which are labeled PULocationID, DOLocationID, total_amount, and trip_distance. I'm trying to group by both PULocationID and DOLocationID, then count the combination each into a column called "count". I also need to take the average of total_amount and trip_distance and divide them into a column called "trip_rate". The end DF should be:
PULocationID
DOLocationID
count
trip_rate
123
422
1
5.2435
3
27
4
6.6121
Where (123,422) are paired together once for a trip rate of $5.24 and (3, 27) are paired together 4 times where the trip rate is $6.61.
Through reading some other threads, I'm able to group by the locations and count them using the below:
df.groupBy("PULocationID", 'DOLocationID').agg(count(lit(1)).alias("count")).show()
OR I can group by the locations and get the averages of the two columns I need using the below:
df.groupBy("PULocationID", 'DOLocationID').agg({'total_amount':'avg', 'trip_distance':'avg'}).show()
I tried a couple of things to get the trip_rate, but neither worked:
df.withColumn("trip_rate", (pyspark.sql.functions.col("total_amount") / pyspark.sql.functions.col("trip_distance")))
df.withColumn("trip_rate", df.total_amount/sum(df.trip_distance))
I also can't figure out how to combine the two queries that work (i.e. count of locations + averages).

Using this as an example input DataFrame:
+------------+------------+------------+-------------+
|PULocationID|DOLocationID|total_amount|trip_distance|
+------------+------------+------------+-------------+
| 123| 422| 10.487| 2|
| 3| 27| 19.8363| 3|
| 3| 27| 13.2242| 2|
| 3| 27| 6.6121| 1|
| 3| 27| 26.4484| 4|
+------------+------------+------------+-------------+
You can chain together the groupBy, agg, and select (you could also use withColumn and drop if you only need the 4 columns).
import pyspark.sql.functions as F
new_df = df.groupBy(
"PULocationID",
"DOLocationID",
).agg(
F.count(F.lit(1)).alias("count"),
F.avg(F.col("total_amount")).alias("avg_amt"),
F.avg(F.col("trip_distance")).alias("avg_distance"),
).select(
"PULocationID",
"DOLocationID",
"count",
(F.col("avg_amt") / F.col("avg_distance")).alias("trip_rate")
)
new_df.show()
+------------+------------+-----+-----------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+-----------------+
| 123| 422| 1| 5.2435|
| 3| 27| 4|6.612100000000001|
+------------+------------+-----+-----------------+

GroupBy with condition on aggregate Spark/Scala

I have a dataframe like this :
| ID_VISITE_CALCULE| TAG_TS_TO_TS|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| RK|
+--------------------+-------------------+------------------+------------------------+---+
|GA1.2.1023040287....|2019-04-23 11:24:19| dupont| null| 1|
|GA1.2.1023040287....|2019-04-23 11:24:19| durand| null| 2|
|GA1.2.105243141.1...|2019-04-23 11:21:01| null| null| 1|
|GA1.2.1061963529....|2019-04-23 11:12:19| null| null| 1|
|GA1.2.1065635192....|2019-04-23 11:07:14| antoni| null| 1|
|GA1.2.1074357108....|2019-04-23 11:11:34| lang| null| 1|
|GA1.2.1074357108....|2019-04-23 11:12:37| lang| null| 2|
|GA1.2.1075803022....|2019-04-23 11:28:38| cavail| null| 1|
|GA1.2.1080137035....|2019-04-23 11:20:00| null| null| 1|
|GA1.2.1081805479....|2019-04-23 11:10:49| null| null| 1|
|GA1.2.1081805479....|2019-04-23 11:10:49| linare| null| 2|
|GA1.2.1111218536....|2019-04-23 11:28:43| null| null| 1|
|GA1.2.1111218536....|2019-04-23 11:32:26| null| null| 2|
|GA1.2.1111570355....|2019-04-23 11:07:00| null| null| 1|
+--------------------+-------------------+------------------+------------------------+---+
I'm trying to apply rules to aggregate by ID_VISITE_CALCULE and keep only one row for an ID.
For an ID (a group), I wish:
get the first timestamp of the group and store it in a START column
get the last timestamp of the group and store it in an END column
test if EXTERNAL_PERSON_ID is the same for the whole group.
If this is the case and it is NULL then I write NULL, if it is and it is a name then I write the name. Finally if there are different values in the group then I register UNDEFINED
apply exactly the same rules to the column EXTERNAL_ORGANIZATION_ID
RESULT :
+--------------------+------------------+------------------------+-------------------+-------------------+
| ID_VISITE_CALCULE|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| START| END|
+--------------------+------------------+------------------------+-------------------+-------------------+
|GA1.2.1023040287....| undefined| null|2019-04-23 11:24:19|2019-04-23 11:24:19|
|GA1.2.105243141.1...| null| null|2019-04-23 11:21:01|2019-04-23 11:21:01|
|GA1.2.1061963529....| null| null|2019-04-23 11:12:19|2019-04-23 11:12:19|
|GA1.2.1065635192....| antoni| null|2019-04-23 11:07:14|2019-04-23 11:07:14|
|GA1.2.1074357108....| lang| null|2019-04-23 11:11:34|2019-04-23 11:12:37|
|GA1.2.1075803022....| cavail| null|2019-04-23 11:28:38|2019-04-23 11:28:38|
|GA1.2.1080137035....| null| null|2019-04-23 11:20:00|2019-04-23 11:20:00|
|GA1.2.1081805479....| undefined| null|2019-04-23 11:10:49|2019-04-23 11:10:49|
|GA1.2.1111218536....| null| null|2019-04-23 11:28:43|2019-04-23 11:32:26|
|GA1.2.1111570355....| null| null|2019-04-23 11:07:00|2019-04-23 11:07:00|
+--------------------+------------------+------------------------+-------------------+-------------------+
In my example, I only have 2 lines for a group at most, but in the real dataset I can have several hundred lines in a group.
Thank you for your kind assistance.

All can be done in single groupby call, however I'd suggest for the (slight) performance benefits and for readability of the code to split into 2 calls:
import org.apache.spark.sql.functions.{col, size, collect_set, max, min, when, lit}
val res1DF = df.groupBy(col("ID_VISITE_CALCULE")).agg(
min(col("START")).alias("START"),
max(col("END")).alias("END"),
collect_set(col("EXTERNAL_PERSON_ID")).alias("EXTERNAL_PERSON_ID"),
collect_set(col("EXTERNAL_ORGANIZATION_ID")).alias("EXTERNAL_ORGANIZATION_ID")
)
val res2DF = res1DF.withColumn("EXTERNAL_PERSON_ID",
when(
size(col("EXTERNAL_PERSON_ID")) > 1,
lit("UNDEFINED")).otherwise(col("EXTERNAL_PERSON_ID").getItem(0)
)
).withColumn("EXTERNAL_ORGANIZATION_ID",
when(
size(col("EXTERNAL_ORGANIZATION_ID")) > 1,
lit("UNDEFINED")).otherwise(col("EXTERNAL_ORGANIZATION_ID").getItem(0)
)
)
The method getItem does most of the conditions in the background. If the set of values is empty, it will return null and if there is just 1 single value, it will return the value.

/It would be good if you show some code/ Sample Data from where dataframe is built.
Assuming you have a dataframe as tableDf
** Spark Sql Solution **
tableDf.createOrReplaceTempView("input_table")
val sqlStr ="""
select ID_VISITE_CALCULE,
(case when count(distinct person_id_calculation) > 1 then "undefined"
when count(distinct person_id_calculation) = 1 and
max(person_id_calculation) = "noNull" then ""
else max(person_id_calculation)) as EXTERNAL_PERSON_ID,
-- do the same for EXTERNAL_ORGANISATION_ID
max(start_v) as start_v, max(last_v) as last_v
from
(select ID_VISITE_CALCULE,
( case
when nvl(EXTERNAL_PERSON_ID,"noNull") =
lag(EXTERNAL_PERSON_ID,1,"noNull")over(partition by
ID_VISITE_CALCULE order by TAG_TS_TO_TS) then
EXTERNAL_PERSON_ID
else "undefined" end ) AS person_id_calculation,
-- Same calculation for EXTERNAL_ORGANISATION_ID
first(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by
TAG_TS_TO_TS) as START_V,
last(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by
TAG_TS_TO_TS) as last_V
from input_table ) a
group by 1
"""
val resultDf = spark.sql(sqlStr)

Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.
For example,
Suppose we have the following dataframe:
Category Color Number Letter
1 Red 4 A
1 Yellow Null B
3 Green 8 C
2 Blue Null A
1 Green 9 A
3 Green 8 B
3 Yellow Null C
2 Blue 9 B
3 Blue 8 B
1 Blue Null Null
1 Red 7 C
2 Green Null C
1 Yellow 7 Null
3 Red Null B
Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).
So the output would ideally be:
Category Color CountNumber(Non-Nulls) Size MeanNumber ModeNumber ModeCountNumber CountLetter(Non-Nulls) ModeLetter ModeCountLetter
1 Red 2 2 5.5 4 (or 7)
1 Yellow 1 2 7 7
1 Green 1 1 9 9
1 Blue 1 1 - -
2 Blue 1 2 9 9 etc
2 Green - 1 - -
3 Green 2 2 8 8
3 Yellow - 1 - -
3 Blue 1 1 8 8
3 Red - 1 - -
This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.
Thanks.

As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:
// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()
// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
$"base.Category" === $"max.Category" and
$"base.Color" === $"max.Color" and
$"base.count" === $"max._max")
.select($"base.Category", $"base.Color", $"base.Number", $"_max")
.groupBy("Category", "Color")
.agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
.where($"ModeNumber".isNotNull)
// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
count("Color") as "Size", // counting a key column -> includes nulls
count("Number") as "CountNumber", // does not include nulls
mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")
result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// | 3|Yellow| 1| 0| null| null| null|
// | 1| Green| 1| 1| 9.0| 9| 1|
// | 1| Red| 2| 2| 5.5| 7| 1|
// | 2| Green| 1| 0| null| null| null|
// | 3| Blue| 1| 1| 8.0| 8| 1|
// | 1|Yellow| 2| 1| 7.0| 7| 1|
// | 2| Blue| 2| 1| 9.0| 9| 1|
// | 3| Green| 2| 2| 8.0| 8| 2|
// | 1| Blue| 1| 0| null| null| null|
// | 3| Red| 1| 0| null| null| null|
// +--------+------+----+-----------+----------+----------+---------------+
As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...
As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.

Spark Dataframe sliding window over pair of rows

I have an eventlog in csv consisting of three columns timestamp, eventId and userId.
What I would like to do is append a new column nextEventId to the dataframe.
An example eventlog:
eventlog = sqlContext.createDataFrame(Array((20160101, 1, 0),(20160102,3,1),(20160201,4,1),(20160202, 2,0))).toDF("timestamp", "eventId", "userId")
eventlog.show(4)
|timestamp|eventId|userId|
+---------+-------+------+
| 20160101| 1| 0|
| 20160102| 3| 1|
| 20160201| 4| 1|
| 20160202| 2| 0|
+---------+-------+------+
The desired endresult would be:
|timestamp|eventId|userId|nextEventId|
+---------+-------+------+-----------+
| 20160101| 1| 0| 2|
| 20160102| 3| 1| 4|
| 20160201| 4| 1| Nil|
| 20160202| 2| 0| Nil|
+---------+-------+------+-----------+
So far I've been messing around with sliding windows but can't figure out how to compare 2 rows...
val w = Window.partitionBy("userId").orderBy(asc("timestamp")) //should be a sliding window over 2 rows...
val nextNodes = second($"eventId").over(w) //should work if there are only 2 rows

What you're looking for is lead (or lag). Using window you already defined:
import org.apache.spark.sql.functions.lead
eventlog.withColumn("nextEventId", lead("eventId", 1).over(w))
For true sliding window (like sliding average) you can use rowsBetween or rangeBetween clauses of the window definition but it is not really required here. Nevertheless example usage could be something like this:
val w2 = Window.partitionBy("userId")
.orderBy(asc("timestamp"))
.rowsBetween(-1, 0)
avg($"foo").over(w2)